Merge remote-tracking branch 'upstream/develop' into develop

3 years ago · e91bff79f5
parent b74f9d7812 793effa122
commit e91bff79f5
159 changed files with 11052 additions and 1031 deletions
--- a/.pre-commit-hooks/copyright-check.hook
+++ b/.pre-commit-hooks/copyright-check.hook
@ -19,7 +19,7 @@ import subprocess
 import platform

 COPYRIGHT = '''
-Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.

 Licensed under the Apache License, Version 2.0 (the "License");
 you may not use this file except in compliance with the License.
--- a/README.md
+++ b/README.md
@ -178,7 +178,10 @@ Via the easy-to-use, efficient, flexible and scalable implementation, our vision
  - 🧩  *Cascaded models application*: as an extension of the typical traditional audio tasks, we combine the workflows of the aforementioned tasks with other fields like Natural language processing (NLP) and Computer Vision (CV).

 ### Recent Update
- 🎉 2023.03.07: Add [TTS ARM Linux C++ Demo](./demos/TTSArmLinux).
+- 🔥 2023.03.14: Add SVS(Singing Voice Synthesis) examples with Opencpop dataset, including [DiffSinger](./examples/opencpop/svs1)、[PWGAN](./examples/opencpop/voc1) and [HiFiGAN](./examples/opencpop/voc5), the effect is continuously optimized.
+- 👑 2023.03.09: Add [Wav2vec2ASR-zh](./examples/aishell/asr3).
+- 🎉 2023.03.07: Add [TTS ARM Linux C++ Demo (with C++ Chinese Text Frontend)](./demos/TTSArmLinux).
+- 🔥 2023.03.03 Add Voice Conversion [StarGANv2-VC synthesize pipeline](./examples/vctk/vc3).
 - 🎉 2023.02.16: Add [Cantonese TTS](./examples/canton/tts3).
 - 🔥 2023.01.10: Add [code-switch asr CLI and Demos](./demos/speech_recognition).
 - 👑 2023.01.06: Add [code-switch asr tal_cs recipe](./examples/tal_cs/asr1/).
@ -575,14 +578,14 @@ PaddleSpeech supports a series of most popular models. They are summarized in [r
  </thead>
  <tbody>
    <tr>
-    <td> Text Frontend </td>
-    <td colspan="2"> &emsp; </td>
-    <td>
-    <a href = "./examples/other/tn">tn</a> / <a href = "./examples/other/g2p">g2p</a>
-    </td>
+      <td> Text Frontend </td>
+      <td colspan="2"> &emsp; </td>
+      <td>
+      <a href = "./examples/other/tn">tn</a> / <a href = "./examples/other/g2p">g2p</a>
+      </td>
    </tr>
    <tr>
-      <td rowspan="5">Acoustic Model</td>
+      <td rowspan="6">Acoustic Model</td>
      <td>Tacotron2</td>
      <td>LJSpeech / CSMSC</td>
      <td>
@ -617,6 +620,13 @@ PaddleSpeech supports a series of most popular models. They are summarized in [r
      <a href = "./examples/vctk/ernie_sat">ERNIE-SAT-vctk</a> / <a href = "./examples/aishell3/ernie_sat">ERNIE-SAT-aishell3</a> / <a href = "./examples/aishell3_vctk/ernie_sat">ERNIE-SAT-zh_en</a>
      </td>
    </tr>
+    <tr>
+      <td>DiffSinger</td>
+      <td>Opencpop</td>
+      <td>
+      <a href = "./examples/opencpop/svs1">DiffSinger-opencpop</a>
+      </td>
+   </tr>
   <tr>
      <td rowspan="6">Vocoder</td>
      <td >WaveFlow</td>
@ -627,9 +637,9 @@ PaddleSpeech supports a series of most popular models. They are summarized in [r
    </tr>
    <tr>
      <td >Parallel WaveGAN</td>
-      <td >LJSpeech / VCTK / CSMSC / AISHELL-3</td>
+      <td >LJSpeech / VCTK / CSMSC / AISHELL-3 / Opencpop</td>
      <td>
-      <a href = "./examples/ljspeech/voc1">PWGAN-ljspeech</a> / <a href = "./examples/vctk/voc1">PWGAN-vctk</a> / <a href = "./examples/csmsc/voc1">PWGAN-csmsc</a> /  <a href = "./examples/aishell3/voc1">PWGAN-aishell3</a>
+      <a href = "./examples/ljspeech/voc1">PWGAN-ljspeech</a> / <a href = "./examples/vctk/voc1">PWGAN-vctk</a> / <a href = "./examples/csmsc/voc1">PWGAN-csmsc</a> /  <a href = "./examples/aishell3/voc1">PWGAN-aishell3</a> / <a href = "./examples/opencpop/voc1">PWGAN-opencpop</a>
      </td>
    </tr>
    <tr>
@ -648,9 +658,9 @@ PaddleSpeech supports a series of most popular models. They are summarized in [r
    </tr>
    <tr>
      <td>HiFiGAN</td>
-      <td>LJSpeech / VCTK / CSMSC / AISHELL-3</td>
+      <td>LJSpeech / VCTK / CSMSC / AISHELL-3 / Opencpop</td>
      <td>
-      <a href = "./examples/ljspeech/voc5">HiFiGAN-ljspeech</a> / <a href = "./examples/vctk/voc5">HiFiGAN-vctk</a> / <a href = "./examples/csmsc/voc5">HiFiGAN-csmsc</a> / <a href = "./examples/aishell3/voc5">HiFiGAN-aishell3</a>
+      <a href = "./examples/ljspeech/voc5">HiFiGAN-ljspeech</a> / <a href = "./examples/vctk/voc5">HiFiGAN-vctk</a> / <a href = "./examples/csmsc/voc5">HiFiGAN-csmsc</a> / <a href = "./examples/aishell3/voc5">HiFiGAN-aishell3</a> / <a href = "./examples/opencpop/voc5">HiFiGAN-opencpop</a>
      </td>
    </tr>
    <tr>
--- a/README_cn.md
+++ b/README_cn.md
@ -183,7 +183,10 @@
  - 🧩 级联模型应用: 作为传统语音任务的扩展，我们结合了自然语言处理、计算机视觉等任务，实现更接近实际需求的产业级应用。

 ### 近期更新
- 🎉 2023.03.07: 新增 [TTS ARM Linux C++ 部署示例](./demos/TTSArmLinux)。
+- 🔥 2023.03.14: 新增基于 Opencpop 数据集的 SVS (歌唱合成) 示例，包含 [DiffSinger](./examples/opencpop/svs1)、[PWGAN](./examples/opencpop/voc1) 和 [HiFiGAN](./examples/opencpop/voc5)，效果持续优化中。
+- 👑 2023.03.09: 新增 [Wav2vec2ASR-zh](./examples/aishell/asr3)。
+- 🎉 2023.03.07: 新增 [TTS ARM Linux C++ 部署示例 (包含 C++ 中文文本前端模块)](./demos/TTSArmLinux)。
+- 🔥 2023.03.03: 新增声音转换模型 [StarGANv2-VC 合成流程](./examples/vctk/vc3)。
 - 🎉 2023.02.16: 新增[粤语语音合成](./examples/canton/tts3)。
 - 🔥 2023.01.10: 新增[中英混合 ASR CLI 和 Demos](./demos/speech_recognition)。
 - 👑 2023.01.06: 新增 [ASR 中英混合 tal_cs 训练推理流程](./examples/tal_cs/asr1/)。
@ -574,43 +577,50 @@ PaddleSpeech 的 **语音合成** 主要包含三个模块：文本前端、声
    <td>
    <a href = "./examples/other/tn">tn</a> / <a href = "./examples/other/g2p">g2p</a>
    </td>
-    </tr>
-    <tr>
-      <td rowspan="5">声学模型</td>
+   </tr>
+   <tr>
+      <td rowspan="6">声学模型</td>
      <td>Tacotron2</td>
      <td>LJSpeech / CSMSC</td>
      <td>
      <a href = "./examples/ljspeech/tts0">tacotron2-ljspeech</a> / <a href = "./examples/csmsc/tts0">tacotron2-csmsc</a>
      </td>
-    </tr>
-    <tr>
+   </tr>
+   <tr>
      <td>Transformer TTS</td>
      <td>LJSpeech</td>
      <td>
      <a href = "./examples/ljspeech/tts1">transformer-ljspeech</a>
      </td>
-    </tr>
-    <tr>
+   </tr>
+   <tr>
      <td>SpeedySpeech</td>
      <td>CSMSC</td>
      <td >
      <a href = "./examples/csmsc/tts2">speedyspeech-csmsc</a>
      </td>
-    </tr>
-    <tr>
+   </tr>
+   <tr>
      <td>FastSpeech2</td>
      <td>LJSpeech / VCTK / CSMSC / AISHELL-3 / ZH_EN / finetune</td>
      <td>
      <a href = "./examples/ljspeech/tts3">fastspeech2-ljspeech</a> / <a href = "./examples/vctk/tts3">fastspeech2-vctk</a> / <a href = "./examples/csmsc/tts3">fastspeech2-csmsc</a> / <a href = "./examples/aishell3/tts3">fastspeech2-aishell3</a> / <a href = "./examples/zh_en_tts/tts3">fastspeech2-zh_en</a> / <a href = "./examples/other/tts_finetune/tts3">fastspeech2-finetune</a>
      </td>
-    </tr>
-    <tr>
+   </tr>
+   <tr>
      <td><a href = "https://arxiv.org/abs/2211.03545">ERNIE-SAT</a></td>
      <td>VCTK / AISHELL-3 / ZH_EN</td>
      <td>
      <a href = "./examples/vctk/ernie_sat">ERNIE-SAT-vctk</a> / <a href = "./examples/aishell3/ernie_sat">ERNIE-SAT-aishell3</a> / <a href = "./examples/aishell3_vctk/ernie_sat">ERNIE-SAT-zh_en</a>
      </td>
-    </tr>
+   </tr>
+   <tr>
+      <td>DiffSinger</td>
+      <td>Opencpop</td>
+      <td>
+      <a href = "./examples/opencpop/svs1">DiffSinger-opencpop</a>
+      </td>
+   </tr>
   <tr>
      <td rowspan="6">声码器</td>
      <td >WaveFlow</td>
@ -621,9 +631,9 @@ PaddleSpeech 的 **语音合成** 主要包含三个模块：文本前端、声
    </tr>
    <tr>
      <td >Parallel WaveGAN</td>
-      <td >LJSpeech / VCTK / CSMSC / AISHELL-3</td>
+      <td >LJSpeech / VCTK / CSMSC / AISHELL-3 / Opencpop</td>
      <td>
-      <a href = "./examples/ljspeech/voc1">PWGAN-ljspeech</a> / <a href = "./examples/vctk/voc1">PWGAN-vctk</a> / <a href = "./examples/csmsc/voc1">PWGAN-csmsc</a> /  <a href = "./examples/aishell3/voc1">PWGAN-aishell3</a>
+      <a href = "./examples/ljspeech/voc1">PWGAN-ljspeech</a> / <a href = "./examples/vctk/voc1">PWGAN-vctk</a> / <a href = "./examples/csmsc/voc1">PWGAN-csmsc</a> /  <a href = "./examples/aishell3/voc1">PWGAN-aishell3</a> / <a href = "./examples/opencpop/voc1">PWGAN-opencpop</a>
      </td>
    </tr>
    <tr>
@ -642,9 +652,9 @@ PaddleSpeech 的 **语音合成** 主要包含三个模块：文本前端、声
    </tr>
    <tr>
      <td >HiFiGAN</td>
-      <td >LJSpeech / VCTK / CSMSC / AISHELL-3</td>
+      <td >LJSpeech / VCTK / CSMSC / AISHELL-3 / Opencpop</td>
      <td>
-      <a href = "./examples/ljspeech/voc5">HiFiGAN-ljspeech</a> / <a href = "./examples/vctk/voc5">HiFiGAN-vctk</a> / <a href = "./examples/csmsc/voc5">HiFiGAN-csmsc</a> / <a href = "./examples/aishell3/voc5">HiFiGAN-aishell3</a>
+      <a href = "./examples/ljspeech/voc5">HiFiGAN-ljspeech</a> / <a href = "./examples/vctk/voc5">HiFiGAN-vctk</a> / <a href = "./examples/csmsc/voc5">HiFiGAN-csmsc</a> / <a href = "./examples/aishell3/voc5">HiFiGAN-aishell3</a> / <a href = "./examples/opencpop/voc5">HiFiGAN-opencpop</a>
      </td>
    </tr>
    <tr>
@ -701,6 +711,7 @@ PaddleSpeech 的 **语音合成** 主要包含三个模块：文本前端、声
  </tbody>
 </table>

+
 <a name="声音分类模型"></a>
 **声音分类**

--- a/demos/TTSArmLinux/.gitignore
+++ b/demos/TTSArmLinux/.gitignore
@ -1,4 +1,8 @@
+# 目录
 build/
 output/
 libs/
 models/
+
+# 符号连接
+dict
--- a/demos/TTSArmLinux/README.md
+++ b/demos/TTSArmLinux/README.md
@ -10,9 +10,9 @@

 ### 安装依赖

-```
+```bash
 # Ubuntu
-sudo apt install build-essential cmake wget tar unzip
+sudo apt install build-essential cmake pkg-config wget tar unzip

 # CentOS
 sudo yum groupinstall "Development Tools"
@ -25,15 +25,13 @@ sudo yum install cmake wget tar unzip

 可用以下命令下载：

-```
-git clone https://github.com/PaddlePaddle/PaddleSpeech.git
-cd PaddleSpeech/demos/TTSArmLinux
+```bash
 ./download.sh
 ```

 ### 编译 Demo

-```
+```bash
 ./build.sh
 ```

@ -43,12 +41,18 @@ cd PaddleSpeech/demos/TTSArmLinux

 ### 运行

-```
+你可以修改 `./front.conf` 中 `--phone2id_path` 参数为你自己的声学模型的 `phone_id_map.txt` 。
+
+```bash
 ./run.sh
+./run.sh --sentence "语音合成测试"
+./run.sh --sentence "输出到指定的音频文件" --output_wav ./output/test.wav
+./run.sh --help
 ```

-将把 [src/main.cpp](src/main.cpp) 里定义在 `sentencesToChoose` 数组中的十句话转换为 `wav` 文件，保存在 `output` 文件夹中。
+目前只支持中文合成，出现任何英文都会导致程序崩溃。

+如果未指定`--wav_file`，默认输出到`./output/tts.wav`。

 ## 手动编译 Paddle Lite 库

--- a/demos/TTSArmLinux/build-depends.sh
+++ b/demos/TTSArmLinux/build-depends.sh
@ -0,0 +1 @@
+src/TTSCppFrontend/build-depends.sh
--- a/demos/TTSArmLinux/build.sh
+++ b/demos/TTSArmLinux/build.sh
@ -1,8 +1,11 @@
 #!/bin/bash
 set -e
+set -x

 cd "$(dirname "$(realpath "$0")")"

+BASE_DIR="$PWD"
+
 # load configure
 . ./config.sh

@ -10,11 +13,17 @@ cd "$(dirname "$(realpath "$0")")"
 echo "ARM_ABI is ${ARM_ABI}"
 echo "PADDLE_LITE_DIR is ${PADDLE_LITE_DIR}"

-rm -rf build
-mkdir -p build
-cd build
+echo "Build depends..."
+./build-depends.sh "$@"

+mkdir -p "$BASE_DIR/build"
+cd "$BASE_DIR/build"
 cmake -DPADDLE_LITE_DIR="${PADDLE_LITE_DIR}" -DARM_ABI="${ARM_ABI}" ../src
-make
+
+if [ "$*" = "" ]; then
+    make -j$(nproc)
+else
+    make "$@"
+fi

 echo "make successful!"
--- a/demos/TTSArmLinux/clean.sh
+++ b/demos/TTSArmLinux/clean.sh
@ -1,8 +1,11 @@
 #!/bin/bash
 set -e
+set -x

 cd "$(dirname "$(realpath "$0")")"

+BASE_DIR="$PWD"
+
 # load configure
 . ./config.sh

@ -12,3 +15,9 @@ set -x
 rm -rf "$OUTPUT_DIR"
 rm -rf "$LIBS_DIR"
 rm -rf "$MODELS_DIR"
+rm -rf "$BASE_DIR/build"
+
+"$BASE_DIR/src/TTSCppFrontend/clean.sh"
+
+# 符号连接
+rm "$BASE_DIR/dict"
--- a/demos/TTSArmLinux/config.sh
+++ b/demos/TTSArmLinux/config.sh
@ -10,5 +10,6 @@ OUTPUT_DIR="${PWD}/output"
 PADDLE_LITE_DIR="${LIBS_DIR}/inference_lite_lib.armlinux.${ARM_ABI}.gcc.with_extra.with_cv/cxx"
 #PADDLE_LITE_DIR="/path/to/Paddle-Lite/build.lite.linux.${ARM_ABI}.gcc/inference_lite_lib.armlinux.${ARM_ABI}/cxx"

-AM_MODEL_PATH="${MODELS_DIR}/cpu/fastspeech2_csmsc_arm.nb"
-VOC_MODEL_PATH="${MODELS_DIR}/cpu/mb_melgan_csmsc_arm.nb"
+ACOUSTIC_MODEL_PATH="${MODELS_DIR}/cpu/fastspeech2_csmsc_arm.nb"
+VOCODER_PATH="${MODELS_DIR}/cpu/mb_melgan_csmsc_arm.nb"
+FRONT_CONF="${PWD}/front.conf"
--- a/demos/TTSArmLinux/download.sh
+++ b/demos/TTSArmLinux/download.sh
@ -3,6 +3,8 @@ set -e

 cd "$(dirname "$(realpath "$0")")"

+BASE_DIR="$PWD"
+
 # load configure
 . ./config.sh

@ -38,6 +40,10 @@ download() {
    echo '======================='
 }

+########################################
+
+echo "Download models..."
+
 download 'inference_lite_lib.armlinux.armv8.gcc.with_extra.with_cv.tar.gz' \
    'https://paddlespeech.bj.bcebos.com/demos/TTSArmLinux/inference_lite_lib.armlinux.armv8.gcc.with_extra.with_cv.tar.gz' \
    '39e0c6604f97c70f5d13c573d7e709b9' \
@ -54,3 +60,11 @@ download 'fs2cnn_mbmelgan_cpu_v1.3.0.tar.gz' \
    "$MODELS_DIR"

 echo "Done."
+
+########################################
+
+echo "Download dictionary files..."
+
+ln -s src/TTSCppFrontend/front_demo/dict "$BASE_DIR/"
+
+"$BASE_DIR/src/TTSCppFrontend/download.sh"
--- a/demos/TTSArmLinux/front.conf
+++ b/demos/TTSArmLinux/front.conf
@ -0,0 +1,21 @@
+# jieba conf
+--jieba_dict_path=./dict/jieba/jieba.dict.utf8
+--jieba_hmm_path=./dict/jieba/hmm_model.utf8
+--jieba_user_dict_path=./dict/jieba/user.dict.utf8
+--jieba_idf_path=./dict/jieba/idf.utf8
+--jieba_stop_word_path=./dict/jieba/stop_words.utf8
+
+# dict conf fastspeech2_0.4
+--seperate_tone=false
+--word2phone_path=./dict/fastspeech2_nosil_baker_ckpt_0.4/word2phone_fs2.dict
+--phone2id_path=./dict/fastspeech2_nosil_baker_ckpt_0.4/phone_id_map.txt
+--tone2id_path=./dict/fastspeech2_nosil_baker_ckpt_0.4/word2phone_fs2.dict
+
+# dict conf speedyspeech_0.5
+#--seperate_tone=true
+#--word2phone_path=./dict/speedyspeech_nosil_baker_ckpt_0.5/word2phone.dict
+#--phone2id_path=./dict/speedyspeech_nosil_baker_ckpt_0.5/phone_id_map.txt
+#--tone2id_path=./dict/speedyspeech_nosil_baker_ckpt_0.5/tone_id_map.txt
+
+# dict of tranditional_to_simplified
+--trand2simpd_path=./dict/tranditional_to_simplified/trand2simp.txt
--- a/demos/TTSArmLinux/run.sh
+++ b/demos/TTSArmLinux/run.sh
@ -7,12 +7,13 @@ cd "$(dirname "$(realpath "$0")")"
 . ./config.sh

 # create dir
-rm -rf "$OUTPUT_DIR"
 mkdir -p "$OUTPUT_DIR"

 # run
-for i in {1..10}; do
-    (set -x; ./build/paddlespeech_tts_demo "$AM_MODEL_PATH" "$VOC_MODEL_PATH" $i "$OUTPUT_DIR/$i.wav")
-done
-
-ls -lh "$OUTPUT_DIR"/*.wav
+set -x
+./build/paddlespeech_tts_demo \
+    --front_conf "$FRONT_CONF" \
+    --acoustic_model "$ACOUSTIC_MODEL_PATH" \
+    --vocoder "$VOCODER_PATH" \
+    "$@"
+# end
--- a/demos/TTSArmLinux/src/CMakeLists.txt
+++ b/demos/TTSArmLinux/src/CMakeLists.txt
@ -1,4 +1,18 @@
 cmake_minimum_required(VERSION 3.10)
+project(paddlespeech_tts_demo)
+
+
+########## Global Options ##########
+
+option(WITH_FRONT_DEMO "Build front demo" OFF)
+
+set(CMAKE_CXX_STANDARD 17)
+set(CMAKE_POSITION_INDEPENDENT_CODE ON)
+set(ABSL_PROPAGATE_CXX_STD ON)
+
+
+########## ARM Options ##########
+
 set(CMAKE_SYSTEM_NAME Linux)
 if(ARM_ABI STREQUAL "armv8")
    set(CMAKE_SYSTEM_PROCESSOR aarch64)
@ -13,14 +27,16 @@ else()
    return()
 endif()

-project(paddlespeech_tts_demo)
+
+########## Paddle Lite Options ##########
+
 message(STATUS "TARGET ARCH ABI: ${ARM_ABI}")
 message(STATUS "PADDLE LITE DIR: ${PADDLE_LITE_DIR}")

 include_directories(${PADDLE_LITE_DIR}/include)
 link_directories(${PADDLE_LITE_DIR}/libs/${ARM_ABI})
 link_directories(${PADDLE_LITE_DIR}/lib)
-set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -std=c++11")
+
 if(ARM_ABI STREQUAL "armv8")
    set(CMAKE_CXX_FLAGS "-march=armv8-a ${CMAKE_CXX_FLAGS}")
    set(CMAKE_C_FLAGS "-march=armv8-a ${CMAKE_C_FLAGS}")
@ -29,6 +45,9 @@ elseif(ARM_ABI STREQUAL "armv7hf")
    set(CMAKE_C_FLAGS "-march=armv7-a -mfloat-abi=hard -mfpu=neon-vfpv4 ${CMAKE_C_FLAGS}" )
 endif()

+
+########## Dependencies ##########
+
 find_package(OpenMP REQUIRED)
 if(OpenMP_FOUND OR OpenMP_CXX_FOUND)
    set(CMAKE_C_FLAGS "${CMAKE_C_FLAGS} ${OpenMP_C_FLAGS}")
@ -43,5 +62,19 @@ else()
    return()
 endif()

+
+############### tts cpp frontend ###############
+
+add_subdirectory(TTSCppFrontend)
+
+include_directories(
+    TTSCppFrontend/src
+    third-party/build/src/cppjieba/include
+    third-party/build/src/limonp/include
+)
+
+
+############### paddlespeech_tts_demo ###############
+
 add_executable(paddlespeech_tts_demo main.cc)
-target_link_libraries(paddlespeech_tts_demo paddle_light_api_shared)
+target_link_libraries(paddlespeech_tts_demo paddle_light_api_shared paddlespeech_tts_front)
--- a/demos/TTSArmLinux/src/Predictor.hpp
+++ b/demos/TTSArmLinux/src/Predictor.hpp
@ -1,7 +1,20 @@
+// Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
 #include <algorithm>
 #include <chrono>
-#include <iostream>
 #include <fstream>
+#include <iostream>
 #include <memory>
 #include <string>
 #include <vector>
@ -9,32 +22,84 @@

 using namespace paddle::lite_api;

-typedef int16_t WavDataType;
+class PredictorInterface {
+  public:
+    virtual ~PredictorInterface() = 0;
+    virtual bool Init(const std::string &AcousticModelPath,
+                      const std::string &VocoderPath,
+                      PowerMode cpuPowerMode,
+                      int cpuThreadNum,
+                      // WAV采样率（必须与模型输出匹配）
+                      // 如果播放速度和音调异常，请修改采样率
+                      // 常见采样率：16000, 24000, 32000, 44100, 48000, 96000
+                      uint32_t wavSampleRate) = 0;
+    virtual std::shared_ptr<PaddlePredictor> LoadModel(
+        const std::string &modelPath,
+        int cpuThreadNum,
+        PowerMode cpuPowerMode) = 0;
+    virtual void ReleaseModel() = 0;
+    virtual bool RunModel(const std::vector<int64_t> &phones) = 0;
+    virtual std::unique_ptr<const Tensor> GetAcousticModelOutput(
+        const std::vector<int64_t> &phones) = 0;
+    virtual std::unique_ptr<const Tensor> GetVocoderOutput(
+        std::unique_ptr<const Tensor> &&amOutput) = 0;
+    virtual void VocoderOutputToWav(
+        std::unique_ptr<const Tensor> &&vocOutput) = 0;
+    virtual void SaveFloatWav(float *floatWav, int64_t size) = 0;
+    virtual bool IsLoaded() = 0;
+    virtual float GetInferenceTime() = 0;
+    virtual int GetWavSize() = 0;
+    // 获取WAV持续时间（单位：毫秒）
+    virtual float GetWavDuration() = 0;
+    // 获取RTF（合成时间 / 音频时长）
+    virtual float GetRTF() = 0;
+    virtual void ReleaseWav() = 0;
+    virtual bool WriteWavToFile(const std::string &wavPath) = 0;
+};

-class Predictor {
-public:
-    bool Init(const std::string &AMModelPath, const std::string &VOCModelPath, int cpuThreadNum, const std::string &cpuPowerMode) {
+PredictorInterface::~PredictorInterface() {}
+
+// WavDataType: WAV数据类型
+// 可在 int16_t 和 float 之间切换，
+// 用于生成 16-bit PCM 或 32-bit IEEE float 格式的 WAV
+template <typename WavDataType>
+class Predictor : public PredictorInterface {
+  public:
+    bool Init(const std::string &AcousticModelPath,
+              const std::string &VocoderPath,
+              PowerMode cpuPowerMode,
+              int cpuThreadNum,
+              // WAV采样率（必须与模型输出匹配）
+              // 如果播放速度和音调异常，请修改采样率
+              // 常见采样率：16000, 24000, 32000, 44100, 48000, 96000
+              uint32_t wavSampleRate) override {
        // Release model if exists
        ReleaseModel();

-        AM_predictor_ = LoadModel(AMModelPath, cpuThreadNum, cpuPowerMode);
-        if (AM_predictor_ == nullptr) {
+        acoustic_model_predictor_ =
+            LoadModel(AcousticModelPath, cpuThreadNum, cpuPowerMode);
+        if (acoustic_model_predictor_ == nullptr) {
            return false;
        }
-        VOC_predictor_ = LoadModel(VOCModelPath, cpuThreadNum, cpuPowerMode);
-        if (VOC_predictor_ == nullptr) {
+        vocoder_predictor_ = LoadModel(VocoderPath, cpuThreadNum, cpuPowerMode);
+        if (vocoder_predictor_ == nullptr) {
            return false;
        }

+        wav_sample_rate_ = wavSampleRate;
+
        return true;
    }

-    ~Predictor() {
+    virtual ~Predictor() {
        ReleaseModel();
        ReleaseWav();
    }

-    std::shared_ptr<PaddlePredictor> LoadModel(const std::string &modelPath, int cpuThreadNum, const std::string &cpuPowerMode) {
+    std::shared_ptr<PaddlePredictor> LoadModel(
+        const std::string &modelPath,
+        int cpuThreadNum,
+        PowerMode cpuPowerMode) override {
        if (modelPath.empty()) {
            return nullptr;
        }
@ -43,33 +108,17 @@ public:
        MobileConfig config;
        config.set_model_from_file(modelPath);
        config.set_threads(cpuThreadNum);
-
-        if (cpuPowerMode == "LITE_POWER_HIGH") {
-            config.set_power_mode(PowerMode::LITE_POWER_HIGH);
-        } else if (cpuPowerMode == "LITE_POWER_LOW") {
-            config.set_power_mode(PowerMode::LITE_POWER_LOW);
-        } else if (cpuPowerMode == "LITE_POWER_FULL") {
-            config.set_power_mode(PowerMode::LITE_POWER_FULL);
-        } else if (cpuPowerMode == "LITE_POWER_NO_BIND") {
-            config.set_power_mode(PowerMode::LITE_POWER_NO_BIND);
-        } else if (cpuPowerMode == "LITE_POWER_RAND_HIGH") {
-            config.set_power_mode(PowerMode::LITE_POWER_RAND_HIGH);
-        } else if (cpuPowerMode == "LITE_POWER_RAND_LOW") {
-            config.set_power_mode(PowerMode::LITE_POWER_RAND_LOW);
-        } else {
-            std::cerr << "Unknown cpu power mode!" << std::endl;
-            return nullptr;
-        }
+        config.set_power_mode(cpuPowerMode);

        return CreatePaddlePredictor<MobileConfig>(config);
    }

-    void ReleaseModel() {
-        AM_predictor_ = nullptr;
-        VOC_predictor_ = nullptr;
+    void ReleaseModel() override {
+        acoustic_model_predictor_ = nullptr;
+        vocoder_predictor_ = nullptr;
    }

-    bool RunModel(const std::vector<int64_t> &phones) {
+    bool RunModel(const std::vector<int64_t> &phones) override {
        if (!IsLoaded()) {
            return false;
        }
@ -78,28 +127,29 @@ public:
        auto start = std::chrono::system_clock::now();

        // 执行推理
-        VOCOutputToWav(GetAMOutput(phones));
+        VocoderOutputToWav(GetVocoderOutput(GetAcousticModelOutput(phones)));

        // 计时结束
        auto end = std::chrono::system_clock::now();

        // 计算用时
        std::chrono::duration<float> duration = end - start;
-        inference_time_ = duration.count() * 1000; // 单位：毫秒
+        inference_time_ = duration.count() * 1000;  // 单位：毫秒

        return true;
    }

-    std::unique_ptr<const Tensor> GetAMOutput(const std::vector<int64_t> &phones) {
-        auto phones_handle = AM_predictor_->GetInput(0);
+    std::unique_ptr<const Tensor> GetAcousticModelOutput(
+        const std::vector<int64_t> &phones) override {
+        auto phones_handle = acoustic_model_predictor_->GetInput(0);
        phones_handle->Resize({static_cast<int64_t>(phones.size())});
        phones_handle->CopyFromCpu(phones.data());
-        AM_predictor_->Run();
+        acoustic_model_predictor_->Run();

        // 获取输出Tensor
-        auto am_output_handle = AM_predictor_->GetOutput(0);
+        auto am_output_handle = acoustic_model_predictor_->GetOutput(0);
        // 打印输出Tensor的shape
-        std::cout << "AM Output shape: ";
+        std::cout << "Acoustic Model Output shape: ";
        auto shape = am_output_handle->shape();
        for (auto s : shape) {
            std::cout << s << ", ";
@ -109,75 +159,91 @@ public:
        return am_output_handle;
    }

-    void VOCOutputToWav(std::unique_ptr<const Tensor> &&input) {
-        auto mel_handle = VOC_predictor_->GetInput(0);
+    std::unique_ptr<const Tensor> GetVocoderOutput(
+        std::unique_ptr<const Tensor> &&amOutput) override {
+        auto mel_handle = vocoder_predictor_->GetInput(0);
        // [?, 80]
-        auto dims = input->shape();
+        auto dims = amOutput->shape();
        mel_handle->Resize(dims);
-        auto am_output_data = input->mutable_data<float>();
+        auto am_output_data = amOutput->mutable_data<float>();
        mel_handle->CopyFromCpu(am_output_data);
-        VOC_predictor_->Run();
+        vocoder_predictor_->Run();

        // 获取输出Tensor
-        auto voc_output_handle = VOC_predictor_->GetOutput(0);
+        auto voc_output_handle = vocoder_predictor_->GetOutput(0);
        // 打印输出Tensor的shape
-        std::cout << "VOC Output shape: ";
+        std::cout << "Vocoder Output shape: ";
        auto shape = voc_output_handle->shape();
        for (auto s : shape) {
            std::cout << s << ", ";
        }
        std::cout << std::endl;

+        return voc_output_handle;
+    }
+
+    void VocoderOutputToWav(
+        std::unique_ptr<const Tensor> &&vocOutput) override {
        // 获取输出Tensor的数据
        int64_t output_size = 1;
-        for (auto dim : voc_output_handle->shape()) {
+        for (auto dim : vocOutput->shape()) {
            output_size *= dim;
        }
-        auto output_data = voc_output_handle->mutable_data<float>();
+        auto output_data = vocOutput->mutable_data<float>();

        SaveFloatWav(output_data, output_size);
    }

-    inline float Abs(float number) {
-        return (number < 0) ? -number : number;
-    }
+    void SaveFloatWav(float *floatWav, int64_t size) override;

-    void SaveFloatWav(float *floatWav, int64_t size) {
-        wav_.resize(size);
-        float maxSample = 0.01;
-        // 寻找最大采样值
-        for (int64_t i=0; i<size; i++) {
-            float sample = Abs(floatWav[i]);
-            if (sample > maxSample) {
-                maxSample = sample;
-            }
-        }
-        // 把采样值缩放到 int_16 范围
-        for (int64_t i=0; i<size; i++) {
-            wav_[i] = floatWav[i] * 32767.0f / maxSample;
-        }
+    bool IsLoaded() override {
+        return acoustic_model_predictor_ != nullptr &&
+               vocoder_predictor_ != nullptr;
    }

-    bool IsLoaded() {
-        return AM_predictor_ != nullptr && VOC_predictor_ != nullptr;
-    }
+    float GetInferenceTime() override { return inference_time_; }

-    float GetInferenceTime() {
-        return inference_time_;
-    }
+    const std::vector<WavDataType> &GetWav() { return wav_; }

-    const std::vector<WavDataType> & GetWav() {
-        return wav_;
-    }
+    int GetWavSize() override { return wav_.size() * sizeof(WavDataType); }

-    int GetWavSize() {
-        return wav_.size() * sizeof(WavDataType);
+    // 获取WAV持续时间（单位：毫秒）
+    float GetWavDuration() override {
+        return static_cast<float>(GetWavSize()) / sizeof(WavDataType) /
+               static_cast<float>(wav_sample_rate_) * 1000;
    }

-    void ReleaseWav() {
-        wav_.clear();
+    // 获取RTF（合成时间 / 音频时长）
+    float GetRTF() override { return GetInferenceTime() / GetWavDuration(); }
+
+    void ReleaseWav() override { wav_.clear(); }
+
+    bool WriteWavToFile(const std::string &wavPath) override {
+        std::ofstream fout(wavPath, std::ios::binary);
+        if (!fout.is_open()) {
+            return false;
+        }
+
+        // 写入头信息
+        WavHeader header;
+        header.audio_format = GetWavAudioFormat();
+        header.data_size = GetWavSize();
+        header.size = sizeof(header) - 8 + header.data_size;
+        header.sample_rate = wav_sample_rate_;
+        header.byte_rate = header.sample_rate * header.num_channels *
+                           header.bits_per_sample / 8;
+        header.block_align = header.num_channels * header.bits_per_sample / 8;
+        fout.write(reinterpret_cast<const char *>(&header), sizeof(header));
+
+        // 写入wav数据
+        fout.write(reinterpret_cast<const char *>(wav_.data()),
+                   header.data_size);
+
+        fout.close();
+        return true;
    }

+  protected:
    struct WavHeader {
        // RIFF 头
        char riff[4] = {'R', 'I', 'F', 'F'};
@ -187,15 +253,11 @@ public:
        // FMT 头
        char fmt[4] = {'f', 'm', 't', ' '};
        uint32_t fmt_size = 16;
-        uint16_t audio_format = 1; // 1为整数编码，3为浮点编码
+        uint16_t audio_format = 0;
        uint16_t num_channels = 1;
-
-        // 如果播放速度和音调异常，请修改采样率
-        // 常见采样率：16000, 24000, 32000, 44100, 48000, 96000
-        uint32_t sample_rate = 24000;
-
-        uint32_t byte_rate = 64000;
-        uint16_t block_align = 2;
+        uint32_t sample_rate = 0;
+        uint32_t byte_rate = 0;
+        uint16_t block_align = 0;
        uint16_t bits_per_sample = sizeof(WavDataType) * 8;

        // DATA 头
@ -203,30 +265,56 @@ public:
        uint32_t data_size = 0;
    };

-    bool WriteWavToFile(const std::string &wavPath) {
-        std::ofstream fout(wavPath, std::ios::binary);
-        if (!fout.is_open()) {
-            return false;
-        }
-
-        // 写入头信息
-        WavHeader header;
-        header.data_size = GetWavSize();
-        header.size = sizeof(header) - 8 + header.data_size;
-        header.byte_rate = header.sample_rate * header.num_channels * header.bits_per_sample / 8;
-        header.block_align = header.num_channels * header.bits_per_sample / 8;
-        fout.write(reinterpret_cast<const char*>(&header), sizeof(header));
+    enum WavAudioFormat {
+        WAV_FORMAT_16BIT_PCM = 1,   // 16-bit PCM 格式
+        WAV_FORMAT_32BIT_FLOAT = 3  // 32-bit IEEE float 格式
+    };

-        // 写入wav数据
-        fout.write(reinterpret_cast<const char*>(wav_.data()), header.data_size);
+  protected:
+    // 返回值通过模板特化由 WavDataType 决定
+    inline uint16_t GetWavAudioFormat();

-        fout.close();
-        return true;
-    }
+    inline float Abs(float number) { return (number < 0) ? -number : number; }

-private:
+  protected:
    float inference_time_ = 0;
-    std::shared_ptr<PaddlePredictor> AM_predictor_ = nullptr;
-    std::shared_ptr<PaddlePredictor> VOC_predictor_ = nullptr;
+    uint32_t wav_sample_rate_ = 0;
    std::vector<WavDataType> wav_;
+    std::shared_ptr<PaddlePredictor> acoustic_model_predictor_ = nullptr;
+    std::shared_ptr<PaddlePredictor> vocoder_predictor_ = nullptr;
 };
+
+template <>
+uint16_t Predictor<int16_t>::GetWavAudioFormat() {
+    return Predictor::WAV_FORMAT_16BIT_PCM;
+}
+
+template <>
+uint16_t Predictor<float>::GetWavAudioFormat() {
+    return Predictor::WAV_FORMAT_32BIT_FLOAT;
+}
+
+// 保存 16-bit PCM 格式 WAV
+template <>
+void Predictor<int16_t>::SaveFloatWav(float *floatWav, int64_t size) {
+    wav_.resize(size);
+    float maxSample = 0.01;
+    // 寻找最大采样值
+    for (int64_t i = 0; i < size; i++) {
+        float sample = Abs(floatWav[i]);
+        if (sample > maxSample) {
+            maxSample = sample;
+        }
+    }
+    // 把采样值缩放到 int_16 范围
+    for (int64_t i = 0; i < size; i++) {
+        wav_[i] = floatWav[i] * 32767.0f / maxSample;
+    }
+}
+
+// 保存 32-bit IEEE float 格式 WAV
+template <>
+void Predictor<float>::SaveFloatWav(float *floatWav, int64_t size) {
+    wav_.resize(size);
+    std::copy_n(floatWav, size, wav_.data());
+}
--- a/demos/TTSArmLinux/src/TTSCppFrontend
+++ b/demos/TTSArmLinux/src/TTSCppFrontend
@ -0,0 +1 @@
+../../TTSCppFrontend/
--- a/demos/TTSArmLinux/src/main.cc
+++ b/demos/TTSArmLinux/src/main.cc
@ -1,72 +1,162 @@
+// Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+#include <front/front_interface.h>
+#include <gflags/gflags.h>
+#include <glog/logging.h>
+#include <paddle_api.h>
 #include <cstdlib>
 #include <iostream>
+#include <map>
 #include <memory>
-#include "paddle_api.h"
+#include <string>
 #include "Predictor.hpp"

 using namespace paddle::lite_api;

-std::vector<std::vector<int64_t>> sentencesToChoose = {
-    // 009901 昨日，这名“伤者”与医生全部被警方依法刑事拘留。
-    {261, 231, 175, 116, 179, 262, 44, 154, 126, 177, 19, 262, 42, 241, 72, 177, 56, 174, 245, 37, 186, 37, 49, 151, 127, 69, 19, 179, 72, 69, 4, 260, 126, 177, 116, 151, 239, 153, 141},
-    // 009902 钱伟长想到上海来办学校是经过深思熟虑的。
-    {174, 83, 213, 39, 20, 260, 89, 40, 30, 177, 22, 71, 9, 153, 8, 37, 17, 260, 251, 260, 99, 179, 177, 116, 151, 125, 70, 233, 177, 51, 176, 108, 177, 184, 153, 242, 40, 45},
-    // 009903 她见我一进门就骂，吃饭时也骂，骂得我抬不起头。
-    {182, 2, 151, 85, 232, 73, 151, 123, 154, 52, 151, 143, 154, 5, 179, 39, 113, 69, 17, 177, 114, 105, 154, 5, 179, 154, 5, 40, 45, 232, 182, 8, 37, 186, 174, 74, 182, 168},
-    // 009904 李述德在离开之前，只说了一句“柱驼杀父亲了”。
-    {153, 74, 177, 186, 40, 42, 261, 10, 153, 73, 152, 7, 262, 113, 174, 83, 179, 262, 115, 177, 230, 153, 45, 73, 151, 242, 180, 262, 186, 182, 231, 177, 2, 69, 186, 174, 124, 153, 45},
-    // 009905 这种车票和保险单捆绑出售属于重复性购买。
-    {262, 44, 262, 163, 39, 41, 173, 99, 71, 42, 37, 28, 260, 84, 40, 14, 179, 152, 220, 37, 21, 39, 183, 177, 170, 179, 177, 185, 240, 39, 162, 69, 186, 260, 128, 70, 170, 154, 9},
-    // 009906 戴佩妮的男友西米露接唱情歌，让她非常开心。
-    {40, 10, 173, 49, 155, 72, 40, 45, 155, 15, 142, 260, 72, 154, 74, 153, 186, 179, 151, 103, 39, 22, 174, 126, 70, 41, 179, 175, 22, 182, 2, 69, 46, 39, 20, 152, 7, 260, 120},
-    // 009907 观大势、谋大局、出大策始终是该院的办院方针。
-    {70, 199, 40, 5, 177, 116, 154, 168, 40, 5, 151, 240, 179, 39, 183, 40, 5, 38, 44, 179, 177, 115, 262, 161, 177, 116, 70, 7, 247, 40, 45, 37, 17, 247, 69, 19, 262, 51},
-    // 009908 他们骑着摩托回家，正好为农忙时的父母帮忙。
-    {182, 2, 154, 55, 174, 73, 262, 45, 154, 157, 182, 230, 71, 212, 151, 77, 180, 262, 59, 71, 29, 214, 155, 162, 154, 20, 177, 114, 40, 45, 69, 186, 154, 185, 37, 19, 154, 20},
-    // 009909 但是因为还没到退休年龄，只能掰着指头捱日子。
-    {40, 17, 177, 116, 120, 214, 71, 8, 154, 47, 40, 30, 182, 214, 260, 140, 155, 83, 153, 126, 180, 262, 115, 155, 57, 37, 7, 262, 45, 262, 115, 182, 171, 8, 175, 116, 261, 112},
-    // 009910 这几天雨水不断，人们恨不得待在家里不出门。
-    {262, 44, 151, 74, 182, 82, 240, 177, 213, 37, 184, 40, 202, 180, 175, 52, 154, 55, 71, 54, 37, 186, 40, 42, 40, 7, 261, 10, 151, 77, 153, 74, 37, 186, 39, 183, 154, 52},
-};
-
-void usage(const char *binName) {
-    std::cerr << "Usage:" << std::endl
-        << "\t" << binName << " <AM-model-path> <VOC-model-path> <sentences-index:1-10> <output-wav-path>" << std::endl;
-}
+DEFINE_string(
+    sentence,
+    "你好，欢迎使用语音合成服务",
+    "Text to be synthesized (Chinese only. English will crash the program.)");
+DEFINE_string(front_conf, "./front.conf", "Front configuration file");
+DEFINE_string(acoustic_model,
+              "./models/cpu/fastspeech2_csmsc_arm.nb",
+              "Acoustic model .nb file");
+DEFINE_string(vocoder,
+              "./models/cpu/fastspeech2_csmsc_arm.nb",
+              "vocoder .nb file");
+DEFINE_string(output_wav, "./output/tts.wav", "Output WAV file");
+DEFINE_string(wav_bit_depth,
+              "16",
+              "WAV bit depth, 16 (16-bit PCM) or 32 (32-bit IEEE float)");
+DEFINE_string(wav_sample_rate,
+              "24000",
+              "WAV sample rate, should match the output of the vocoder");
+DEFINE_string(cpu_thread, "1", "CPU thread numbers");

 int main(int argc, char *argv[]) {
-    if (argc < 5) {
-        usage(argv[0]);
+    gflags::ParseCommandLineFlags(&argc, &argv, true);
+
+    PredictorInterface *predictor;
+
+    if (FLAGS_wav_bit_depth == "16") {
+        predictor = new Predictor<int16_t>();
+    } else if (FLAGS_wav_bit_depth == "32") {
+        predictor = new Predictor<float>();
+    } else {
+        LOG(ERROR) << "Unsupported WAV bit depth: " << FLAGS_wav_bit_depth;
        return -1;
    }
-    const char *AMModelPath = argv[1];
-    const char *VOCModelPath = argv[2];
-    int sentencesIndex = atoi(argv[3]) - 1;
-    const char *outputWavPath = argv[4];

-    if (sentencesIndex < 0 || sentencesIndex >= sentencesToChoose.size()) {
-        std::cerr << "sentences-index out of range" << std::endl;
+
+    /////////////////////////// 前端：文本转音素 ///////////////////////////
+
+    // 实例化文本前端引擎
+    ppspeech::FrontEngineInterface *front_inst = nullptr;
+    front_inst = new ppspeech::FrontEngineInterface(FLAGS_front_conf);
+    if ((!front_inst) || (front_inst->init())) {
+        LOG(ERROR) << "Creater tts engine failed!";
+        if (front_inst != nullptr) {
+            delete front_inst;
+        }
+        front_inst = nullptr;
        return -1;
    }

-    Predictor predictor;
-    if (!predictor.Init(AMModelPath, VOCModelPath, 1, "LITE_POWER_HIGH")) {
-        std::cerr << "predictor init failed" << std::endl;
+    std::wstring ws_sentence = ppspeech::utf8string2wstring(FLAGS_sentence);
+
+    // 繁体转简体
+    std::wstring sentence_simp;
+    front_inst->Trand2Simp(ws_sentence, &sentence_simp);
+    ws_sentence = sentence_simp;
+
+    std::string s_sentence;
+    std::vector<std::wstring> sentence_part;
+    std::vector<int> phoneids = {};
+    std::vector<int> toneids = {};
+
+    // 根据标点进行分句
+    LOG(INFO) << "Start to segment sentences by punctuation";
+    front_inst->SplitByPunc(ws_sentence, &sentence_part);
+    LOG(INFO) << "Segment sentences through punctuation successfully";
+
+    // 分句后获取音素id
+    LOG(INFO)
+        << "Start to get the phoneme and tone id sequence of each sentence";
+    for (int i = 0; i < sentence_part.size(); i++) {
+        LOG(INFO) << "Raw sentence is: "
+                  << ppspeech::wstring2utf8string(sentence_part[i]);
+        front_inst->SentenceNormalize(&sentence_part[i]);
+        s_sentence = ppspeech::wstring2utf8string(sentence_part[i]);
+        LOG(INFO) << "After normalization sentence is: " << s_sentence;
+
+        if (0 != front_inst->GetSentenceIds(s_sentence, &phoneids, &toneids)) {
+            LOG(ERROR) << "TTS inst get sentence phoneids and toneids failed";
+            return -1;
+        }
+    }
+    LOG(INFO) << "The phoneids of the sentence is: "
+              << limonp::Join(phoneids.begin(), phoneids.end(), " ");
+    LOG(INFO) << "The toneids of the sentence is: "
+              << limonp::Join(toneids.begin(), toneids.end(), " ");
+    LOG(INFO) << "Get the phoneme id sequence of each sentence successfully";
+
+
+    /////////////////////////// 后端：音素转音频 ///////////////////////////
+
+    // WAV采样率（必须与模型输出匹配）
+    // 如果播放速度和音调异常，请修改采样率
+    // 常见采样率：16000, 24000, 32000, 44100, 48000, 96000
+    const uint32_t wavSampleRate = std::stoul(FLAGS_wav_sample_rate);
+
+    // CPU线程数
+    const int cpuThreadNum = std::stol(FLAGS_cpu_thread);
+
+    // CPU电源模式
+    const PowerMode cpuPowerMode = PowerMode::LITE_POWER_HIGH;
+
+    if (!predictor->Init(FLAGS_acoustic_model,
+                         FLAGS_vocoder,
+                         cpuPowerMode,
+                         cpuThreadNum,
+                         wavSampleRate)) {
+        LOG(ERROR) << "predictor init failed" << std::endl;
        return -1;
    }

-    if (!predictor.RunModel(sentencesToChoose[sentencesIndex])) {
-        std::cerr << "predictor run model failed" << std::endl;
+    std::vector<int64_t> phones(phoneids.size());
+    std::transform(phoneids.begin(), phoneids.end(), phones.begin(), [](int x) {
+        return static_cast<int64_t>(x);
+    });
+
+    if (!predictor->RunModel(phones)) {
+        LOG(ERROR) << "predictor run model failed" << std::endl;
        return -1;
    }

-    std::cout << "Inference time: " << predictor.GetInferenceTime() << " ms, "
-              << "WAV size (without header): " << predictor.GetWavSize() << " bytes" << std::endl;
+    LOG(INFO) << "Inference time: " << predictor->GetInferenceTime() << " ms, "
+              << "WAV size (without header): " << predictor->GetWavSize()
+              << " bytes, "
+              << "WAV duration: " << predictor->GetWavDuration() << " ms, "
+              << "RTF: " << predictor->GetRTF() << std::endl;

-    if (!predictor.WriteWavToFile(outputWavPath)) {
-        std::cerr << "write wav file failed" << std::endl;
+    if (!predictor->WriteWavToFile(FLAGS_output_wav)) {
+        LOG(ERROR) << "write wav file failed" << std::endl;
        return -1;
    }

+    delete predictor;
+
    return 0;
 }
--- a/demos/TTSArmLinux/src/third-party
+++ b/demos/TTSArmLinux/src/third-party
@ -0,0 +1 @@
+TTSCppFrontend/third-party
--- a/demos/TTSCppFrontend/.gitignore
+++ b/demos/TTSCppFrontend/.gitignore
@ -0,0 +1,2 @@
+build/
+dict/
--- a/demos/TTSCppFrontend/CMakeLists.txt
+++ b/demos/TTSCppFrontend/CMakeLists.txt
@ -0,0 +1,63 @@
+cmake_minimum_required(VERSION 3.10)
+project(paddlespeech_tts_cpp)
+
+
+########## Global Options ##########
+
+option(WITH_FRONT_DEMO "Build front demo" ON)
+
+set(CMAKE_CXX_STANDARD 17)
+set(CMAKE_POSITION_INDEPENDENT_CODE ON)
+set(ABSL_PROPAGATE_CXX_STD ON)
+
+
+########## Dependencies ##########
+
+set(ENV{PKG_CONFIG_PATH} "${CMAKE_SOURCE_DIR}/third-party/build/lib/pkgconfig:${CMAKE_SOURCE_DIR}/third-party/build/lib64/pkgconfig")
+find_package(PkgConfig REQUIRED)
+
+# It is hard to load xxx-config.cmake in a custom location, so use pkgconfig instead.
+pkg_check_modules(ABSL   REQUIRED absl_strings IMPORTED_TARGET)
+pkg_check_modules(GFLAGS REQUIRED gflags       IMPORTED_TARGET)
+pkg_check_modules(GLOG   REQUIRED libglog      IMPORTED_TARGET)
+
+# load header-only libraries
+include_directories(
+    ${CMAKE_SOURCE_DIR}/third-party/build/src/cppjieba/include
+    ${CMAKE_SOURCE_DIR}/third-party/build/src/limonp/include
+)
+
+find_package(Threads REQUIRED)
+
+
+########## paddlespeech_tts_front ##########
+
+include_directories(src)
+
+file(GLOB FRONT_SOURCES
+    ./src/base/*.cpp
+    ./src/front/*.cpp
+)
+add_library(paddlespeech_tts_front STATIC ${FRONT_SOURCES})
+
+target_link_libraries(
+    paddlespeech_tts_front
+    PUBLIC
+    PkgConfig::GFLAGS
+    PkgConfig::GLOG
+    PkgConfig::ABSL
+    Threads::Threads
+)
+
+
+########## tts_front_demo ##########
+
+if (WITH_FRONT_DEMO)
+
+    file(GLOB FRONT_DEMO_SOURCES front_demo/*.cpp)
+    add_executable(tts_front_demo ${FRONT_DEMO_SOURCES})
+
+    target_include_directories(tts_front_demo PRIVATE ./front_demo)
+    target_link_libraries(tts_front_demo PRIVATE paddlespeech_tts_front)
+
+endif (WITH_FRONT_DEMO)
--- a/demos/TTSCppFrontend/README.md
+++ b/demos/TTSCppFrontend/README.md
@ -0,0 +1,56 @@
+# PaddleSpeech TTS CPP Frontend
+
+A TTS frontend that implements text-to-phoneme conversion.
+
+Currently it only supports Chinese, any English word will crash the demo.
+
+## Install Build Tools
+
+```bash
+# Ubuntu
+sudo apt install build-essential cmake pkg-config
+
+# CentOS
+sudo yum groupinstall "Development Tools"
+sudo yum install cmake
+```
+
+If your cmake version is too old, you can go here to download a precompiled new version: https://cmake.org/download/
+
+## Build
+
+```bash
+# Build with all CPU cores
+./build.sh
+
+# Build with 1 core
+./build.sh -j1
+```
+
+Dependent libraries will be automatically downloaded to the `third-party/build` folder.
+
+If the download speed is too slow, you can open [third-party/CMakeLists.txt](third-party/CMakeLists.txt) and modify `GIT_REPOSITORY` URLs.
+
+## Download dictionary files
+
+```bash
+./download.sh
+```
+
+## Run
+You can change `--phone2id_path` in `./front_demo/front.conf` to the `phone_id_map.txt` of your own acoustic model.
+
+```bash
+./run_front_demo.sh
+./run_front_demo.sh --help
+./run_front_demo.sh --sentence "这是语音合成服务的文本前端，用于将文本转换为音素序号数组。"
+./run_front_demo.sh --front_conf ./front_demo/front.conf --sentence "你还需要一个语音合成后端才能将其转换为实际的声音。"
+```
+
+## Clean
+
+```bash
+./clean.sh
+```
+
+The folders `front_demo/dict`, `build` and `third-party/build` will be deleted.
--- a/demos/TTSCppFrontend/build-depends.sh
+++ b/demos/TTSCppFrontend/build-depends.sh
@ -0,0 +1,20 @@
+#!/bin/bash
+set -e
+set -x
+
+cd "$(dirname "$(realpath "$0")")"
+
+cd ./third-party
+
+mkdir -p build
+cd build
+
+cmake ..
+
+if [ "$*" = "" ]; then
+    make -j$(nproc)
+else
+    make "$@"
+fi
+
+echo "Done."
--- a/demos/TTSCppFrontend/build.sh
+++ b/demos/TTSCppFrontend/build.sh
@ -0,0 +1,21 @@
+#!/bin/bash
+set -e
+set -x
+
+cd "$(dirname "$(realpath "$0")")"
+
+echo "************* Download & Build Dependencies *************"
+./build-depends.sh "$@"
+
+echo "************* Build Front Lib and Demo *************"
+mkdir -p ./build
+cd ./build
+cmake ..
+
+if [ "$*" = "" ]; then
+    make -j$(nproc)
+else
+    make "$@"
+fi
+
+echo "Done."
--- a/demos/TTSCppFrontend/clean.sh
+++ b/demos/TTSCppFrontend/clean.sh
@ -0,0 +1,10 @@
+#!/bin/bash
+set -e
+set -x
+
+cd "$(dirname "$(realpath "$0")")"
+rm -rf "./front_demo/dict"
+rm -rf "./build"
+rm -rf "./third-party/build"
+
+echo "Done."
--- a/demos/TTSCppFrontend/download.sh
+++ b/demos/TTSCppFrontend/download.sh
@ -0,0 +1,62 @@
+#!/bin/bash
+set -e
+
+cd "$(dirname "$(realpath "$0")")"
+
+download() {
+    file="$1"
+    url="$2"
+    md5="$3"
+    dir="$4"
+
+    cd "$dir"
+
+    if [ -f "$file" ] && [ "$(md5sum "$file" | awk '{ print $1 }')" = "$md5" ]; then
+        echo "File $file (MD5: $md5) has been downloaded."
+    else
+        echo "Downloading $file..."
+        wget -O "$file" "$url"
+
+        # MD5 verify
+        fileMd5="$(md5sum "$file" | awk '{ print $1 }')"
+        if [ "$fileMd5" == "$md5" ]; then
+            echo "File $file (MD5: $md5) has been downloaded."
+        else
+            echo "MD5 mismatch, file may be corrupt"
+            echo "$file MD5: $fileMd5, it should be $md5"
+        fi
+    fi
+
+    echo "Extracting $file..."
+    echo '-----------------------'
+    tar -vxf "$file"
+    echo '======================='
+}
+
+########################################
+
+DIST_DIR="$PWD/front_demo/dict"
+
+mkdir -p "$DIST_DIR"
+
+download 'fastspeech2_nosil_baker_ckpt_0.4.tar.gz' \
+    'https://paddlespeech.bj.bcebos.com/t2s/text_frontend/fastspeech2_nosil_baker_ckpt_0.4.tar.gz' \
+    '7bf1bab1737375fa123c413eb429c573' \
+    "$DIST_DIR"
+
+download 'speedyspeech_nosil_baker_ckpt_0.5.tar.gz' \
+    'https://paddlespeech.bj.bcebos.com/t2s/text_frontend/speedyspeech_nosil_baker_ckpt_0.5.tar.gz' \
+    '0b7754b21f324789aef469c61f4d5b8f' \
+    "$DIST_DIR"
+
+download 'jieba.tar.gz' \
+    'https://paddlespeech.bj.bcebos.com/t2s/text_frontend/jieba.tar.gz' \
+    '6d30f426bd8c0025110a483f051315ca' \
+    "$DIST_DIR"
+
+download 'tranditional_to_simplified.tar.gz' \
+    'https://paddlespeech.bj.bcebos.com/t2s/text_frontend/tranditional_to_simplified.tar.gz' \
+    '258f5b59d5ebfe96d02007ca1d274a7f' \
+    "$DIST_DIR"
+
+echo "Done."
--- a/demos/TTSCppFrontend/front_demo/front.conf
+++ b/demos/TTSCppFrontend/front_demo/front.conf
@ -0,0 +1,21 @@
+# jieba conf
+--jieba_dict_path=./front_demo/dict/jieba/jieba.dict.utf8
+--jieba_hmm_path=./front_demo/dict/jieba/hmm_model.utf8
+--jieba_user_dict_path=./front_demo/dict/jieba/user.dict.utf8
+--jieba_idf_path=./front_demo/dict/jieba/idf.utf8
+--jieba_stop_word_path=./front_demo/dict/jieba/stop_words.utf8
+
+# dict conf fastspeech2_0.4
+--seperate_tone=false
+--word2phone_path=./front_demo/dict/fastspeech2_nosil_baker_ckpt_0.4/word2phone_fs2.dict
+--phone2id_path=./front_demo/dict/fastspeech2_nosil_baker_ckpt_0.4/phone_id_map.txt
+--tone2id_path=./front_demo/dict/fastspeech2_nosil_baker_ckpt_0.4/word2phone_fs2.dict
+
+# dict conf speedyspeech_0.5
+#--seperate_tone=true
+#--word2phone_path=./front_demo/dict/speedyspeech_nosil_baker_ckpt_0.5/word2phone.dict
+#--phone2id_path=./front_demo/dict/speedyspeech_nosil_baker_ckpt_0.5/phone_id_map.txt
+#--tone2id_path=./front_demo/dict/speedyspeech_nosil_baker_ckpt_0.5/tone_id_map.txt
+
+# dict of tranditional_to_simplified
+--trand2simpd_path=./front_demo/dict/tranditional_to_simplified/trand2simp.txt
--- a/demos/TTSCppFrontend/front_demo/front_demo.cpp
+++ b/demos/TTSCppFrontend/front_demo/front_demo.cpp
@ -0,0 +1,79 @@
+// Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+#include <gflags/gflags.h>
+#include <glog/logging.h>
+#include <map>
+#include <string>
+#include "front/front_interface.h"
+
+DEFINE_string(sentence, "你好，欢迎使用语音合成服务", "Text to be synthesized");
+DEFINE_string(front_conf, "./front_demo/front.conf", "Front conf file");
+// DEFINE_string(seperate_tone, "true", "If true, get phoneids and tonesid");
+
+
+int main(int argc, char** argv) {
+    gflags::ParseCommandLineFlags(&argc, &argv, true);
+    // 实例化文本前端引擎
+    ppspeech::FrontEngineInterface* front_inst = nullptr;
+    front_inst = new ppspeech::FrontEngineInterface(FLAGS_front_conf);
+    if ((!front_inst) || (front_inst->init())) {
+        LOG(ERROR) << "Creater tts engine failed!";
+        if (front_inst != nullptr) {
+            delete front_inst;
+        }
+        front_inst = nullptr;
+        return -1;
+    }
+
+    std::wstring ws_sentence = ppspeech::utf8string2wstring(FLAGS_sentence);
+
+    // 繁体转简体
+    std::wstring sentence_simp;
+    front_inst->Trand2Simp(ws_sentence, &sentence_simp);
+    ws_sentence = sentence_simp;
+
+    std::string s_sentence;
+    std::vector<std::wstring> sentence_part;
+    std::vector<int> phoneids = {};
+    std::vector<int> toneids = {};
+
+    // 根据标点进行分句
+    LOG(INFO) << "Start to segment sentences by punctuation";
+    front_inst->SplitByPunc(ws_sentence, &sentence_part);
+    LOG(INFO) << "Segment sentences through punctuation successfully";
+
+    // 分句后获取音素id
+    LOG(INFO)
+        << "Start to get the phoneme and tone id sequence of each sentence";
+    for (int i = 0; i < sentence_part.size(); i++) {
+        LOG(INFO) << "Raw sentence is: "
+                  << ppspeech::wstring2utf8string(sentence_part[i]);
+        front_inst->SentenceNormalize(&sentence_part[i]);
+        s_sentence = ppspeech::wstring2utf8string(sentence_part[i]);
+        LOG(INFO) << "After normalization sentence is: " << s_sentence;
+
+        if (0 != front_inst->GetSentenceIds(s_sentence, &phoneids, &toneids)) {
+            LOG(ERROR) << "TTS inst get sentence phoneids and toneids failed";
+            return -1;
+        }
+    }
+    LOG(INFO) << "The phoneids of the sentence is: "
+              << limonp::Join(phoneids.begin(), phoneids.end(), " ");
+    LOG(INFO) << "The toneids of the sentence is: "
+              << limonp::Join(toneids.begin(), toneids.end(), " ");
+    LOG(INFO) << "Get the phoneme id sequence of each sentence successfully";
+
+    return EXIT_SUCCESS;
+}
--- a/demos/TTSCppFrontend/front_demo/gentools/gen_dict_paddlespeech.py
+++ b/demos/TTSCppFrontend/front_demo/gentools/gen_dict_paddlespeech.py
@ -0,0 +1,111 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import argparse
+import configparser
+
+from paddlespeech.t2s.frontend.zh_frontend import Frontend
+
+
+def get_phone(frontend,
+              word,
+              merge_sentences=True,
+              print_info=False,
+              robot=False,
+              get_tone_ids=False):
+    phonemes = frontend.get_phonemes(word, merge_sentences, print_info, robot)
+    # Some optimizations
+    phones, tones = frontend._get_phone_tone(phonemes[0], get_tone_ids)
+    #print(type(phones), phones)
+    #print(type(tones), tones)
+    return phones, tones
+
+
+def gen_word2phone_dict(frontend,
+                        jieba_words_dict,
+                        word2phone_dict,
+                        get_tone=False):
+    with open(jieba_words_dict, "r") as f1, open(word2phone_dict, "w+") as f2:
+        for line in f1.readlines():
+            word = line.split(" ")[0]
+            phone, tone = get_phone(frontend, word, get_tone_ids=get_tone)
+            phone_str = ""
+
+            if tone:
+                assert (len(phone) == len(tone))
+                for i in range(len(tone)):
+                    phone_tone = phone[i] + tone[i]
+                    phone_str += (" " + phone_tone)
+                phone_str = phone_str.strip("sp0").strip(" ")
+            else:
+                for x in phone:
+                    phone_str += (" " + x)
+                phone_str = phone_str.strip("sp").strip(" ")
+            print(phone_str)
+            f2.write(word + " " + phone_str + "\n")
+    print("Generate word2phone dict successfully.")
+
+
+def main():
+    parser = argparse.ArgumentParser(description="Generate dictionary")
+    parser.add_argument(
+        "--config", type=str, default="./config.ini", help="config file.")
+    parser.add_argument(
+        "--am_type",
+        type=str,
+        default="fastspeech2",
+        help="fastspeech2 or speedyspeech")
+    args = parser.parse_args()
+
+    # Read config
+    cf = configparser.ConfigParser()
+    cf.read(args.config)
+    jieba_words_dict_file = cf.get("jieba",
+                                   "jieba_words_dict")  # get words dict
+
+    am_type = args.am_type
+    if (am_type == "fastspeech2"):
+        phone2id_dict_file = cf.get(am_type, "phone2id_dict")
+        word2phone_dict_file = cf.get(am_type, "word2phone_dict")
+
+        frontend = Frontend(phone_vocab_path=phone2id_dict_file)
+        print("frontend done!")
+
+        gen_word2phone_dict(
+            frontend,
+            jieba_words_dict_file,
+            word2phone_dict_file,
+            get_tone=False)
+
+    elif (am_type == "speedyspeech"):
+        phone2id_dict_file = cf.get(am_type, "phone2id_dict")
+        tone2id_dict_file = cf.get(am_type, "tone2id_dict")
+        word2phone_dict_file = cf.get(am_type, "word2phone_dict")
+
+        frontend = Frontend(
+            phone_vocab_path=phone2id_dict_file,
+            tone_vocab_path=tone2id_dict_file)
+        print("frontend done!")
+
+        gen_word2phone_dict(
+            frontend,
+            jieba_words_dict_file,
+            word2phone_dict_file,
+            get_tone=True)
+
+    else:
+        print("Please set correct am type, fastspeech2 or speedyspeech.")
+
+
+if __name__ == "__main__":
+    main()
--- a/demos/TTSCppFrontend/front_demo/gentools/genid.py
+++ b/demos/TTSCppFrontend/front_demo/gentools/genid.py
@ -0,0 +1,35 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+PHONESFILE = "./dict/phones.txt"
+PHONES_ID_FILE = "./dict/phonesid.dict"
+TONESFILE = "./dict/tones.txt"
+TONES_ID_FILE = "./dict/tonesid.dict"
+
+
+def GenIdFile(file, idfile):
+    id = 2
+    with open(file, 'r') as f1, open(idfile, "w+") as f2:
+        f2.write("<pad> 0\n")
+        f2.write("<unk> 1\n")
+        for line in f1.readlines():
+            phone = line.strip()
+            print(phone + " " + str(id) + "\n")
+            f2.write(phone + " " + str(id) + "\n")
+            id += 1
+
+
+if __name__ == "__main__":
+    GenIdFile(PHONESFILE, PHONES_ID_FILE)
+    GenIdFile(TONESFILE, TONES_ID_FILE)
--- a/demos/TTSCppFrontend/front_demo/gentools/word2phones.py
+++ b/demos/TTSCppFrontend/front_demo/gentools/word2phones.py
@ -0,0 +1,55 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import re
+
+from pypinyin import lazy_pinyin
+from pypinyin import Style
+
+worddict = "./dict/jieba_part.dict.utf8"
+newdict = "./dict/word_phones.dict"
+
+
+def GenPhones(initials, finals, seperate=True):
+
+    phones = []
+    for c, v in zip(initials, finals):
+        if re.match(r'i\d', v):
+            if c in ['z', 'c', 's']:
+                v = re.sub('i', 'ii', v)
+            elif c in ['zh', 'ch', 'sh', 'r']:
+                v = re.sub('i', 'iii', v)
+        if c:
+            if seperate is True:
+                phones.append(c + '0')
+            elif seperate is False:
+                phones.append(c)
+            else:
+                print("Not sure whether phone and tone need to be separated")
+        if v:
+            phones.append(v)
+    return phones
+
+
+with open(worddict, "r") as f1, open(newdict, "w+") as f2:
+    for line in f1.readlines():
+        word = line.split(" ")[0]
+        initials = lazy_pinyin(
+            word, neutral_tone_with_five=True, style=Style.INITIALS)
+        finals = lazy_pinyin(
+            word, neutral_tone_with_five=True, style=Style.FINALS_TONE3)
+
+        phones = GenPhones(initials, finals, True)
+
+        temp = " ".join(phones)
+        f2.write(word + " " + temp + "\n")
--- a/demos/TTSCppFrontend/run_front_demo.sh
+++ b/demos/TTSCppFrontend/run_front_demo.sh
@ -0,0 +1,7 @@
+#!/bin/bash
+set -e
+set -x
+
+cd "$(dirname "$(realpath "$0")")"
+
+./build/tts_front_demo "$@"
--- a/demos/TTSCppFrontend/src/base/type_conv.cpp
+++ b/demos/TTSCppFrontend/src/base/type_conv.cpp
@ -0,0 +1,28 @@
+// Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+#include "base/type_conv.h"
+
+namespace ppspeech {
+// wstring to string
+std::string wstring2utf8string(const std::wstring& str) {
+    static std::wstring_convert<std::codecvt_utf8<wchar_t>> strCnv;
+    return strCnv.to_bytes(str);
+}
+
+// string to wstring
+std::wstring utf8string2wstring(const std::string& str) {
+    static std::wstring_convert<std::codecvt_utf8<wchar_t>> strCnv;
+    return strCnv.from_bytes(str);
+}
+}  // namespace ppspeech
--- a/demos/TTSCppFrontend/src/base/type_conv.h
+++ b/demos/TTSCppFrontend/src/base/type_conv.h
@ -0,0 +1,31 @@
+// Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+#ifndef BASE_TYPE_CONVC_H
+#define BASE_TYPE_CONVC_H
+
+#include <codecvt>
+#include <locale>
+#include <string>
+
+
+namespace ppspeech {
+// wstring to string
+std::string wstring2utf8string(const std::wstring& str);
+
+// string to wstring
+std::wstring utf8string2wstring(const std::string& str);
+}
+
+#endif  // BASE_TYPE_CONVC_H
--- a/demos/TTSCppFrontend/src/front/front_interface.cpp
+++ b/demos/TTSCppFrontend/src/front/front_interface.cpp
--- a/demos/TTSCppFrontend/src/front/front_interface.h
+++ b/demos/TTSCppFrontend/src/front/front_interface.h
@ -0,0 +1,198 @@
+// Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+#ifndef PADDLE_TTS_SERVING_FRONT_FRONT_INTERFACE_H
+#define PADDLE_TTS_SERVING_FRONT_FRONT_INTERFACE_H
+
+#include <glog/logging.h>
+#include <fstream>
+#include <map>
+#include <memory>
+#include <string>
+//#include "utils/dir_utils.h"
+#include <cppjieba/Jieba.hpp>
+#include "absl/strings/str_split.h"
+#include "front/text_normalize.h"
+
+
+namespace ppspeech {
+
+class FrontEngineInterface : public TextNormalizer {
+  public:
+    explicit FrontEngineInterface(std::string conf) : _conf_file(conf) {
+        TextNormalizer();
+        _jieba = nullptr;
+        _initialed = false;
+        init();
+    }
+
+    int init();
+    ~FrontEngineInterface() {}
+
+    // 读取配置文件
+    int ReadConfFile();
+
+    // 简体转繁体
+    int Trand2Simp(const std::wstring &sentence, std::wstring *sentence_simp);
+
+    // 生成字典
+    int GenDict(const std::string &file,
+                std::map<std::string, std::string> *map);
+
+    // 由 词+词性的分词结果转为仅包含词的结果
+    int GetSegResult(std::vector<std::pair<std::string, std::string>> *seg,
+                     std::vector<std::string> *seg_words);
+
+    // 生成句子的音素，音调id。如果音素和音调未分开，则 toneids
+    // 为空（fastspeech2），反之则不为空(speedyspeech)
+    int GetSentenceIds(const std::string &sentence,
+                       std::vector<int> *phoneids,
+                       std::vector<int> *toneids);
+
+    // 根据分词结果获取词的音素，音调id，并对读音进行适当修改
+    // (ModifyTone)。如果音素和音调未分开，则 toneids
+    // 为空（fastspeech2），反之则不为空(speedyspeech)
+    int GetWordsIds(
+        const std::vector<std::pair<std::string, std::string>> &cut_result,
+        std::vector<int> *phoneids,
+        std::vector<int> *toneids);
+
+    // 结巴分词生成包含词和词性的分词结果，再对分词结果进行适当修改
+    // (MergeforModify)
+    int Cut(const std::string &sentence,
+            std::vector<std::pair<std::string, std::string>> *cut_result);
+
+    // 字词到音素的映射，查找字典
+    int GetPhone(const std::string &word, std::string *phone);
+
+    // 音素到音素id
+    int Phone2Phoneid(const std::string &phone,
+                      std::vector<int> *phoneid,
+                      std::vector<int> *toneids);
+
+
+    // 根据韵母判断该词中每个字的读音都为第三声。true表示词中每个字都是第三声
+    bool AllToneThree(const std::vector<std::string> &finals);
+
+    // 判断词是否是叠词
+    bool IsReduplication(const std::string &word);
+
+    // 获取每个字词的声母韵母列表
+    int GetInitialsFinals(const std::string &word,
+                          std::vector<std::string> *word_initials,
+                          std::vector<std::string> *word_finals);
+
+    // 获取每个字词的韵母列表
+    int GetFinals(const std::string &word,
+                  std::vector<std::string> *word_finals);
+
+    // 整个词转成向量形式，向量的每个元素对应词的一个字
+    int Word2WordVec(const std::string &word,
+                     std::vector<std::wstring> *wordvec);
+
+    // 将整个词重新进行 full cut，分词后，各个词会在词典中
+    int SplitWord(const std::string &word,
+                  std::vector<std::string> *fullcut_word);
+
+    // 对分词结果进行处理：对包含“不”字的分词结果进行整理
+    std::vector<std::pair<std::string, std::string>> MergeBu(
+        std::vector<std::pair<std::string, std::string>> *seg_result);
+
+    // 对分词结果进行处理：对包含“一”字的分词结果进行整理
+    std::vector<std::pair<std::string, std::string>> Mergeyi(
+        std::vector<std::pair<std::string, std::string>> *seg_result);
+
+    // 对分词结果进行处理：对前后相同的两个字进行合并
+    std::vector<std::pair<std::string, std::string>> MergeReduplication(
+        std::vector<std::pair<std::string, std::string>> *seg_result);
+
+    // 对一个词和后一个词他们的读音均为第三声的两个词进行合并
+    std::vector<std::pair<std::string, std::string>> MergeThreeTones(
+        std::vector<std::pair<std::string, std::string>> *seg_result);
+
+    // 对一个词的最后一个读音和后一个词的第一个读音为第三声的两个词进行合并
+    std::vector<std::pair<std::string, std::string>> MergeThreeTones2(
+        std::vector<std::pair<std::string, std::string>> *seg_result);
+
+    // 对分词结果进行处理：对包含“儿”字的分词结果进行整理
+    std::vector<std::pair<std::string, std::string>> MergeEr(
+        std::vector<std::pair<std::string, std::string>> *seg_result);
+
+    // 对分词结果进行处理、修改
+    int MergeforModify(
+        std::vector<std::pair<std::string, std::string>> *seg_result,
+        std::vector<std::pair<std::string, std::string>> *merge_seg_result);
+
+
+    // 对包含“不”字的相关词音调进行修改
+    int BuSandi(const std::string &word, std::vector<std::string> *finals);
+
+    // 对包含“一”字的相关词音调进行修改
+    int YiSandhi(const std::string &word, std::vector<std::string> *finals);
+
+    // 对一些特殊词（包括量词，语助词等）的相关词音调进行修改
+    int NeuralSandhi(const std::string &word,
+                     const std::string &pos,
+                     std::vector<std::string> *finals);
+
+    // 对包含第三声的相关词音调进行修改
+    int ThreeSandhi(const std::string &word, std::vector<std::string> *finals);
+
+    // 对字词音调进行处理、修改
+    int ModifyTone(const std::string &word,
+                   const std::string &pos,
+                   std::vector<std::string> *finals);
+
+
+    // 对儿化音进行处理
+    std::vector<std::vector<std::string>> MergeErhua(
+        const std::vector<std::string> &initials,
+        const std::vector<std::string> &finals,
+        const std::string &word,
+        const std::string &pos);
+
+
+  private:
+    bool _initialed;
+    cppjieba::Jieba *_jieba;
+    std::vector<std::string> _punc;
+    std::vector<std::string> _punc_omit;
+
+    std::string _conf_file;
+    std::map<std::string, std::string> conf_map;
+    std::map<std::string, std::string> word_phone_map;
+    std::map<std::string, std::string> phone_id_map;
+    std::map<std::string, std::string> tone_id_map;
+    std::map<std::string, std::string> trand_simp_map;
+
+
+    std::string _jieba_dict_path;
+    std::string _jieba_hmm_path;
+    std::string _jieba_user_dict_path;
+    std::string _jieba_idf_path;
+    std::string _jieba_stop_word_path;
+
+    std::string _seperate_tone;
+    std::string _word2phone_path;
+    std::string _phone2id_path;
+    std::string _tone2id_path;
+    std::string _trand2simp_path;
+
+    std::vector<std::string> must_erhua;
+    std::vector<std::string> not_erhua;
+
+    std::vector<std::string> must_not_neural_tone_words;
+    std::vector<std::string> must_neural_tone_words;
+};
+}  // namespace ppspeech
+#endif
--- a/demos/TTSCppFrontend/src/front/text_normalize.cpp
+++ b/demos/TTSCppFrontend/src/front/text_normalize.cpp
@ -0,0 +1,542 @@
+// Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+#include "front/text_normalize.h"
+
+namespace ppspeech {
+
+// 初始化 digits_map and unit_map
+int TextNormalizer::InitMap() {
+    digits_map["0"] = "零";
+    digits_map["1"] = "一";
+    digits_map["2"] = "二";
+    digits_map["3"] = "三";
+    digits_map["4"] = "四";
+    digits_map["5"] = "五";
+    digits_map["6"] = "六";
+    digits_map["7"] = "七";
+    digits_map["8"] = "八";
+    digits_map["9"] = "九";
+
+    units_map[1] = "十";
+    units_map[2] = "百";
+    units_map[3] = "千";
+    units_map[4] = "万";
+    units_map[8] = "亿";
+
+    return 0;
+}
+
+// 替换
+int TextNormalizer::Replace(std::wstring *sentence,
+                            const int &pos,
+                            const int &len,
+                            const std::wstring &repstr) {
+    // 删除原来的
+    sentence->erase(pos, len);
+    // 插入新的
+    sentence->insert(pos, repstr);
+    return 0;
+}
+
+// 根据标点符号切分句子
+int TextNormalizer::SplitByPunc(const std::wstring &sentence,
+                                std::vector<std::wstring> *sentence_part) {
+    std::wstring temp = sentence;
+    std::wregex reg(L"[：，；。？！,;?!]");
+    std::wsmatch match;
+
+    while (std::regex_search(temp, match, reg)) {
+        sentence_part->push_back(
+            temp.substr(0, match.position(0) + match.length(0)));
+        Replace(&temp, 0, match.position(0) + match.length(0), L"");
+    }
+    // 如果最后没有标点符号
+    if (temp != L"") {
+        sentence_part->push_back(temp);
+    }
+    return 0;
+}
+
+// 数字转文本，10200 - > 一万零二百
+std::string TextNormalizer::CreateTextValue(const std::string &num_str,
+                                            bool use_zero) {
+    std::string num_lstrip =
+        std::string(absl::StripPrefix(num_str, "0")).data();
+    int len = num_lstrip.length();
+
+    if (len == 0) {
+        return "";
+    } else if (len == 1) {
+        if (use_zero && (len < num_str.length())) {
+            return digits_map["0"] + digits_map[num_lstrip];
+        } else {
+            return digits_map[num_lstrip];
+        }
+    } else {
+        int largest_unit = 0;  // 最大单位
+        std::string first_part;
+        std::string second_part;
+
+        if (len > 1 && len <= 2) {
+            largest_unit = 1;
+        } else if (len > 2 && len <= 3) {
+            largest_unit = 2;
+        } else if (len > 3 && len <= 4) {
+            largest_unit = 3;
+        } else if (len > 4 && len <= 8) {
+            largest_unit = 4;
+        } else if (len > 8) {
+            largest_unit = 8;
+        }
+
+        first_part = num_str.substr(0, num_str.length() - largest_unit);
+        second_part = num_str.substr(num_str.length() - largest_unit);
+
+        return CreateTextValue(first_part, use_zero) + units_map[largest_unit] +
+               CreateTextValue(second_part, use_zero);
+    }
+}
+
+// 数字一个一个对应，可直接用于年份，电话，手机，
+std::string TextNormalizer::SingleDigit2Text(const std::string &num_str,
+                                             bool alt_one) {
+    std::string text = "";
+    if (alt_one) {
+        digits_map["1"] = "幺";
+    } else {
+        digits_map["1"] = "一";
+    }
+
+    for (size_t i = 0; i < num_str.size(); i++) {
+        std::string num_int(1, num_str[i]);
+        if (digits_map.find(num_int) == digits_map.end()) {
+            LOG(ERROR) << "digits_map doesn't have key: " << num_int;
+        }
+        text += digits_map[num_int];
+    }
+
+    return text;
+}
+
+std::string TextNormalizer::SingleDigit2Text(const std::wstring &num,
+                                             bool alt_one) {
+    std::string num_str = wstring2utf8string(num);
+    return SingleDigit2Text(num_str, alt_one);
+}
+
+//  数字整体对应，可直接用于月份，日期，数值整数部分
+std::string TextNormalizer::MultiDigit2Text(const std::string &num_str,
+                                            bool alt_one,
+                                            bool use_zero) {
+    LOG(INFO) << "aaaaaaaaaaaaaaaa: " << alt_one << use_zero;
+    if (alt_one) {
+        digits_map["1"] = "幺";
+    } else {
+        digits_map["1"] = "一";
+    }
+
+    std::wstring result =
+        utf8string2wstring(CreateTextValue(num_str, use_zero));
+    std::wstring result_0(1, result[0]);
+    std::wstring result_1(1, result[1]);
+    // 一十八 --> 十八
+    if ((result_0 == utf8string2wstring(digits_map["1"])) &&
+        (result_1 == utf8string2wstring(units_map[1]))) {
+        return wstring2utf8string(result.substr(1, result.length()));
+    } else {
+        return wstring2utf8string(result);
+    }
+}
+
+std::string TextNormalizer::MultiDigit2Text(const std::wstring &num,
+                                            bool alt_one,
+                                            bool use_zero) {
+    std::string num_str = wstring2utf8string(num);
+    return MultiDigit2Text(num_str, alt_one, use_zero);
+}
+
+// 数字转文本，包括整数和小数
+std::string TextNormalizer::Digits2Text(const std::string &num_str) {
+    std::string text;
+    std::vector<std::string> integer_decimal;
+    integer_decimal = absl::StrSplit(num_str, ".");
+
+    if (integer_decimal.size() == 1) {  // 整数
+        text = MultiDigit2Text(integer_decimal[0]);
+    } else if (integer_decimal.size() == 2) {  // 小数
+        if (integer_decimal[0] == "") {  // 无整数的小数类型，例如：.22
+            text = "点" +
+                   SingleDigit2Text(
+                       std::string(absl::StripSuffix(integer_decimal[1], "0"))
+                           .data());
+        } else {  // 常规小数类型，例如：12.34
+            text = MultiDigit2Text(integer_decimal[0]) + "点" +
+                   SingleDigit2Text(
+                       std::string(absl::StripSuffix(integer_decimal[1], "0"))
+                           .data());
+        }
+    } else {
+        return "The value does not conform to the numeric format";
+    }
+
+    return text;
+}
+
+std::string TextNormalizer::Digits2Text(const std::wstring &num) {
+    std::string num_str = wstring2utf8string(num);
+    return Digits2Text(num_str);
+}
+
+// 日期，2021年8月18日 --> 二零二一年八月十八日
+int TextNormalizer::ReData(std::wstring *sentence) {
+    std::wregex reg(
+        L"(\\d{4}|\\d{2})年((0?[1-9]|1[0-2])月)?(((0?[1-9])|((1|2)[0-9])|30|31)"
+        L"([日号]))?");
+    std::wsmatch match;
+    std::string rep;
+
+    while (std::regex_search(*sentence, match, reg)) {
+        rep = "";
+        rep += SingleDigit2Text(match[1]) + "年";
+        if (match[3] != L"") {
+            rep += MultiDigit2Text(match[3], false, false) + "月";
+        }
+        if (match[5] != L"") {
+            rep += MultiDigit2Text(match[5], false, false) +
+                   wstring2utf8string(match[9]);
+        }
+
+        Replace(sentence,
+                match.position(0),
+                match.length(0),
+                utf8string2wstring(rep));
+    }
+
+    return 0;
+}
+
+
+// XX-XX-XX or XX/XX/XX 例如：2021/08/18 --> 二零二一年八月十八日
+int TextNormalizer::ReData2(std::wstring *sentence) {
+    std::wregex reg(
+        L"(\\d{4})([- /.])(0[1-9]|1[012])\\2(0[1-9]|[12][0-9]|3[01])");
+    std::wsmatch match;
+    std::string rep;
+
+    while (std::regex_search(*sentence, match, reg)) {
+        rep = "";
+        rep += (SingleDigit2Text(match[1]) + "年");
+        rep += (MultiDigit2Text(match[3], false, false) + "月");
+        rep += (MultiDigit2Text(match[4], false, false) + "日");
+        Replace(sentence,
+                match.position(0),
+                match.length(0),
+                utf8string2wstring(rep));
+    }
+
+    return 0;
+}
+
+// XX:XX:XX   09:09:02 --> 九点零九分零二秒
+int TextNormalizer::ReTime(std::wstring *sentence) {
+    std::wregex reg(L"([0-1]?[0-9]|2[0-3]):([0-5][0-9])(:([0-5][0-9]))?");
+    std::wsmatch match;
+    std::string rep;
+
+    while (std::regex_search(*sentence, match, reg)) {
+        rep = "";
+        rep += (MultiDigit2Text(match[1], false, false) + "点");
+        if (absl::StartsWith(wstring2utf8string(match[2]), "0")) {
+            rep += "零";
+        }
+        rep += (MultiDigit2Text(match[2]) + "分");
+        if (absl::StartsWith(wstring2utf8string(match[4]), "0")) {
+            rep += "零";
+        }
+        rep += (MultiDigit2Text(match[4]) + "秒");
+
+        Replace(sentence,
+                match.position(0),
+                match.length(0),
+                utf8string2wstring(rep));
+    }
+
+    return 0;
+}
+
+// 温度，例如：-24.3℃ --> 零下二十四点三度
+int TextNormalizer::ReTemperature(std::wstring *sentence) {
+    std::wregex reg(L"(-?)(\\d+(\\.\\d+)?)(°C|℃|度|摄氏度)");
+    std::wsmatch match;
+    std::string rep;
+    std::string sign;
+    std::vector<std::string> integer_decimal;
+    std::string unit;
+
+    while (std::regex_search(*sentence, match, reg)) {
+        match[1] == L"-" ? sign = "负" : sign = "";
+        match[4] == L"摄氏度" ? unit = "摄氏度" : unit = "度";
+        rep = sign + Digits2Text(match[2]) + unit;
+
+        Replace(sentence,
+                match.position(0),
+                match.length(0),
+                utf8string2wstring(rep));
+    }
+
+    return 0;
+}
+
+// 分数，例如： 1/3 --> 三分之一
+int TextNormalizer::ReFrac(std::wstring *sentence) {
+    std::wregex reg(L"(-?)(\\d+)/(\\d+)");
+    std::wsmatch match;
+    std::string sign;
+    std::string rep;
+    while (std::regex_search(*sentence, match, reg)) {
+        match[1] == L"-" ? sign = "负" : sign = "";
+        rep = sign + MultiDigit2Text(match[3]) + "分之" +
+              MultiDigit2Text(match[2]);
+        Replace(sentence,
+                match.position(0),
+                match.length(0),
+                utf8string2wstring(rep));
+    }
+
+    return 0;
+}
+
+// 百分数，例如：45.5% --> 百分之四十五点五
+int TextNormalizer::RePercentage(std::wstring *sentence) {
+    std::wregex reg(L"(-?)(\\d+(\\.\\d+)?)%");
+    std::wsmatch match;
+    std::string sign;
+    std::string rep;
+    std::vector<std::string> integer_decimal;
+
+    while (std::regex_search(*sentence, match, reg)) {
+        match[1] == L"-" ? sign = "负" : sign = "";
+        rep = sign + "百分之" + Digits2Text(match[2]);
+
+        Replace(sentence,
+                match.position(0),
+                match.length(0),
+                utf8string2wstring(rep));
+    }
+
+    return 0;
+}
+
+// 手机号码，例如：+86 18883862235 --> 八六幺八八八三八六二二三五
+int TextNormalizer::ReMobilePhone(std::wstring *sentence) {
+    std::wregex reg(
+        L"(\\d)?((\\+?86 ?)?1([38]\\d|5[0-35-9]|7[678]|9[89])\\d{8})(\\d)?");
+    std::wsmatch match;
+    std::string rep;
+    std::vector<std::string> country_phonenum;
+
+    while (std::regex_search(*sentence, match, reg)) {
+        country_phonenum = absl::StrSplit(wstring2utf8string(match[0]), "+");
+        rep = "";
+        for (int i = 0; i < country_phonenum.size(); i++) {
+            LOG(INFO) << country_phonenum[i];
+            rep += SingleDigit2Text(country_phonenum[i], true);
+        }
+        Replace(sentence,
+                match.position(0),
+                match.length(0),
+                utf8string2wstring(rep));
+    }
+
+    return 0;
+}
+
+// 座机号码，例如：010-51093154 --> 零幺零五幺零九三幺五四
+int TextNormalizer::RePhone(std::wstring *sentence) {
+    std::wregex reg(
+        L"(\\d)?((0(10|2[1-3]|[3-9]\\d{2})-?)?[1-9]\\d{6,7})(\\d)?");
+    std::wsmatch match;
+    std::vector<std::string> zone_phonenum;
+    std::string rep;
+
+    while (std::regex_search(*sentence, match, reg)) {
+        rep = "";
+        zone_phonenum = absl::StrSplit(wstring2utf8string(match[0]), "-");
+        for (int i = 0; i < zone_phonenum.size(); i++) {
+            rep += SingleDigit2Text(zone_phonenum[i], true);
+        }
+        Replace(sentence,
+                match.position(0),
+                match.length(0),
+                utf8string2wstring(rep));
+    }
+
+    return 0;
+}
+
+// 范围，例如：60~90 --> 六十到九十
+int TextNormalizer::ReRange(std::wstring *sentence) {
+    std::wregex reg(
+        L"((-?)((\\d+)(\\.\\d+)?)|(\\.(\\d+)))[-~]((-?)((\\d+)(\\.\\d+)?)|(\\.("
+        L"\\d+)))");
+    std::wsmatch match;
+    std::string rep;
+    std::string sign1;
+    std::string sign2;
+
+    while (std::regex_search(*sentence, match, reg)) {
+        rep = "";
+        match[2] == L"-" ? sign1 = "负" : sign1 = "";
+        if (match[6] != L"") {
+            rep += sign1 + Digits2Text(match[6]) + "到";
+        } else {
+            rep += sign1 + Digits2Text(match[3]) + "到";
+        }
+        match[9] == L"-" ? sign2 = "负" : sign2 = "";
+        if (match[13] != L"") {
+            rep += sign2 + Digits2Text(match[13]);
+        } else {
+            rep += sign2 + Digits2Text(match[10]);
+        }
+
+        Replace(sentence,
+                match.position(0),
+                match.length(0),
+                utf8string2wstring(rep));
+    }
+
+    return 0;
+}
+
+// 带负号的整数，例如：-10 --> 负十
+int TextNormalizer::ReInterger(std::wstring *sentence) {
+    std::wregex reg(L"(-)(\\d+)");
+    std::wsmatch match;
+    std::string rep;
+    while (std::regex_search(*sentence, match, reg)) {
+        rep = "负" + MultiDigit2Text(match[2]);
+        Replace(sentence,
+                match.position(0),
+                match.length(0),
+                utf8string2wstring(rep));
+    }
+
+    return 0;
+}
+
+// 纯小数
+int TextNormalizer::ReDecimalNum(std::wstring *sentence) {
+    std::wregex reg(L"(-?)((\\d+)(\\.\\d+))|(\\.(\\d+))");
+    std::wsmatch match;
+    std::string sign;
+    std::string rep;
+    // std::vector<std::string> integer_decimal;
+    while (std::regex_search(*sentence, match, reg)) {
+        match[1] == L"-" ? sign = "负" : sign = "";
+        if (match[5] != L"") {
+            rep = sign + Digits2Text(match[5]);
+        } else {
+            rep = sign + Digits2Text(match[2]);
+        }
+
+        Replace(sentence,
+                match.position(0),
+                match.length(0),
+                utf8string2wstring(rep));
+    }
+
+    return 0;
+}
+
+// 正整数 + 量词
+int TextNormalizer::RePositiveQuantifiers(std::wstring *sentence) {
+    std::wstring common_quantifiers =
+        L"(朵|匹|张|座|回|场|尾|条|个|首|阙|阵|网|炮|顶|丘|棵|只|支|袭|辆|挑|"
+        L"担|颗|壳|窠|曲|墙|群|腔|砣|座|客|贯|扎|捆|刀|令|打|手|罗|坡|山|岭|江|"
+        L"溪|钟|队|单|双|对|出|口|头|脚|板|跳|枝|件|贴|针|线|管|名|位|身|堂|课|"
+        L"本|页|家|户|层|丝|毫|厘|分|钱|两|斤|担|铢|石|钧|锱|忽|(千|毫|微)克|"
+        L"毫|厘|(公)分|分|寸|尺|丈|里|寻|常|铺|程|(千|分|厘|毫|微)米|米|撮|勺|"
+        L"合|升|斗|石|盘|碗|碟|叠|桶|笼|盆|盒|杯|钟|斛|锅|簋|篮|盘|桶|罐|瓶|壶|"
+        L"卮|盏|箩|箱|煲|啖|袋|钵|年|月|日|季|刻|时|周|天|秒|分|旬|纪|岁|世|更|"
+        L"夜|春|夏|秋|冬|代|伏|辈|丸|泡|粒|颗|幢|堆|条|根|支|道|面|片|张|颗|块|"
+        L"元|(亿|千万|百万|万|千|百)|(亿|千万|百万|万|千|百|美|)元|(亿|千万|"
+        L"百万|万|千|百|)块|角|毛|分)";
+    std::wregex reg(L"(\\d+)([多余几])?" + common_quantifiers);
+    std::wsmatch match;
+    std::string rep;
+    while (std::regex_search(*sentence, match, reg)) {
+        rep = MultiDigit2Text(match[1]);
+        Replace(sentence,
+                match.position(1),
+                match.length(1),
+                utf8string2wstring(rep));
+    }
+
+    return 0;
+}
+
+// 编号类数字，例如： 89757 --> 八九七五七
+int TextNormalizer::ReDefalutNum(std::wstring *sentence) {
+    std::wregex reg(L"\\d{3}\\d*");
+    std::wsmatch match;
+    while (std::regex_search(*sentence, match, reg)) {
+        Replace(sentence,
+                match.position(0),
+                match.length(0),
+                utf8string2wstring(SingleDigit2Text(match[0])));
+    }
+
+    return 0;
+}
+
+int TextNormalizer::ReNumber(std::wstring *sentence) {
+    std::wregex reg(L"(-?)((\\d+)(\\.\\d+)?)|(\\.(\\d+))");
+    std::wsmatch match;
+    std::string sign;
+    std::string rep;
+    while (std::regex_search(*sentence, match, reg)) {
+        match[1] == L"-" ? sign = "负" : sign = "";
+        if (match[5] != L"") {
+            rep = sign + Digits2Text(match[5]);
+        } else {
+            rep = sign + Digits2Text(match[2]);
+        }
+
+        Replace(sentence,
+                match.position(0),
+                match.length(0),
+                utf8string2wstring(rep));
+    }
+    return 0;
+}
+
+// 整体正则，按顺序
+int TextNormalizer::SentenceNormalize(std::wstring *sentence) {
+    ReData(sentence);
+    ReData2(sentence);
+    ReTime(sentence);
+    ReTemperature(sentence);
+    ReFrac(sentence);
+    RePercentage(sentence);
+    ReMobilePhone(sentence);
+    RePhone(sentence);
+    ReRange(sentence);
+    ReInterger(sentence);
+    ReDecimalNum(sentence);
+    RePositiveQuantifiers(sentence);
+    ReDefalutNum(sentence);
+    ReNumber(sentence);
+    return 0;
+}
+}  // namespace ppspeech
--- a/demos/TTSCppFrontend/src/front/text_normalize.h
+++ b/demos/TTSCppFrontend/src/front/text_normalize.h
@ -0,0 +1,77 @@
+// Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+#ifndef PADDLE_TTS_SERVING_FRONT_TEXT_NORMALIZE_H
+#define PADDLE_TTS_SERVING_FRONT_TEXT_NORMALIZE_H
+
+#include <glog/logging.h>
+#include <codecvt>
+#include <map>
+#include <regex>
+#include <string>
+#include "absl/strings/str_split.h"
+#include "absl/strings/strip.h"
+#include "base/type_conv.h"
+
+namespace ppspeech {
+
+class TextNormalizer {
+  public:
+    TextNormalizer() { InitMap(); }
+    ~TextNormalizer() {}
+
+    int InitMap();
+    int Replace(std::wstring *sentence,
+                const int &pos,
+                const int &len,
+                const std::wstring &repstr);
+    int SplitByPunc(const std::wstring &sentence,
+                    std::vector<std::wstring> *sentence_part);
+
+    std::string CreateTextValue(const std::string &num, bool use_zero = true);
+    std::string SingleDigit2Text(const std::string &num_str,
+                                 bool alt_one = false);
+    std::string SingleDigit2Text(const std::wstring &num, bool alt_one = false);
+    std::string MultiDigit2Text(const std::string &num_str,
+                                bool alt_one = false,
+                                bool use_zero = true);
+    std::string MultiDigit2Text(const std::wstring &num,
+                                bool alt_one = false,
+                                bool use_zero = true);
+    std::string Digits2Text(const std::string &num_str);
+    std::string Digits2Text(const std::wstring &num);
+
+    int ReData(std::wstring *sentence);
+    int ReData2(std::wstring *sentence);
+    int ReTime(std::wstring *sentence);
+    int ReTemperature(std::wstring *sentence);
+    int ReFrac(std::wstring *sentence);
+    int RePercentage(std::wstring *sentence);
+    int ReMobilePhone(std::wstring *sentence);
+    int RePhone(std::wstring *sentence);
+    int ReRange(std::wstring *sentence);
+    int ReInterger(std::wstring *sentence);
+    int ReDecimalNum(std::wstring *sentence);
+    int RePositiveQuantifiers(std::wstring *sentence);
+    int ReDefalutNum(std::wstring *sentence);
+    int ReNumber(std::wstring *sentence);
+    int SentenceNormalize(std::wstring *sentence);
+
+
+  private:
+    std::map<std::string, std::string> digits_map;
+    std::map<int, std::string> units_map;
+};
+}  // namespace ppspeech
+
+#endif
--- a/demos/TTSCppFrontend/third-party/CMakeLists.txt
+++ b/demos/TTSCppFrontend/third-party/CMakeLists.txt
@ -0,0 +1,64 @@
+cmake_minimum_required(VERSION 3.10)
+project(tts_third_party_libs)
+
+include(ExternalProject)
+
+# gflags
+ExternalProject_Add(gflags
+    GIT_REPOSITORY https://github.com/gflags/gflags.git
+    GIT_TAG        v2.2.2
+    PREFIX         ${CMAKE_CURRENT_BINARY_DIR}
+    INSTALL_DIR    ${CMAKE_CURRENT_BINARY_DIR}
+    CMAKE_ARGS     -DCMAKE_INSTALL_PREFIX=<INSTALL_DIR>
+                   -DCMAKE_POSITION_INDEPENDENT_CODE=ON
+                   -DBUILD_STATIC_LIBS=OFF
+                   -DBUILD_SHARED_LIBS=ON
+)
+
+# glog
+ExternalProject_Add(
+    glog
+    GIT_REPOSITORY https://github.com/google/glog.git
+    GIT_TAG        v0.6.0
+    PREFIX         ${CMAKE_CURRENT_BINARY_DIR}
+    INSTALL_DIR    ${CMAKE_CURRENT_BINARY_DIR}
+    CMAKE_ARGS     -DCMAKE_INSTALL_PREFIX=<INSTALL_DIR>
+                   -DCMAKE_POSITION_INDEPENDENT_CODE=ON
+    DEPENDS        gflags
+)
+
+# abseil
+ExternalProject_Add(
+    abseil
+    GIT_REPOSITORY https://github.com/abseil/abseil-cpp.git
+    GIT_TAG        20230125.1
+    PREFIX         ${CMAKE_CURRENT_BINARY_DIR}
+    INSTALL_DIR    ${CMAKE_CURRENT_BINARY_DIR}
+    CMAKE_ARGS     -DCMAKE_INSTALL_PREFIX=<INSTALL_DIR>
+                   -DCMAKE_POSITION_INDEPENDENT_CODE=ON
+                   -DABSL_PROPAGATE_CXX_STD=ON
+)
+
+# cppjieba (header-only)
+ExternalProject_Add(
+    cppjieba
+    GIT_REPOSITORY https://github.com/yanyiwu/cppjieba.git
+    GIT_TAG        v5.0.3
+    PREFIX         ${CMAKE_CURRENT_BINARY_DIR}
+    CONFIGURE_COMMAND ""
+    BUILD_COMMAND     ""
+    INSTALL_COMMAND   ""
+    TEST_COMMAND      ""
+)
+
+# limonp (header-only)
+ExternalProject_Add(
+    limonp
+    GIT_REPOSITORY https://github.com/yanyiwu/limonp.git
+    GIT_TAG        v0.6.6
+    PREFIX         ${CMAKE_CURRENT_BINARY_DIR}
+    CONFIGURE_COMMAND ""
+    BUILD_COMMAND     ""
+    INSTALL_COMMAND   ""
+    TEST_COMMAND      ""
+)
--- a/demos/streaming_asr_server/README.md
+++ b/demos/streaming_asr_server/README.md
@ -9,7 +9,7 @@ This demo is an implementation of starting the streaming speech service and acce

 Streaming ASR server only support `websocket` protocol, and doesn't support `http` protocol.

-服务接口定义请参考:
+For service interface definitions, please refer to:
 - [PaddleSpeech Streaming Server WebSocket API](https://github.com/PaddlePaddle/PaddleSpeech/wiki/PaddleSpeech-Server-WebSocket-API)

 ## Usage
@ -23,7 +23,7 @@ You can choose one way from easy, meduim and hard to install paddlespeech.
 **If you install in easy mode, you need to prepare the yaml file by yourself, you can refer to 

 ### 2. Prepare config File
-The configuration file can be found in `conf/ws_application.yaml` 和 `conf/ws_conformer_wenetspeech_application.yaml`.
+The configuration file can be found in `conf/ws_application.yaml` or `conf/ws_conformer_wenetspeech_application.yaml`.

 At present, the speech tasks integrated by the model include: DeepSpeech2 and conformer.

@ -87,7 +87,7 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav

  server_executor = ServerExecutor()
  server_executor(
-      config_file="./conf/ws_conformer_wenetspeech_application.yaml",
+      config_file="./conf/ws_conformer_wenetspeech_application_faster.yaml",
      log_file="./log/paddlespeech.log")
  ```

--- a/demos/streaming_asr_server/README_cn.md
+++ b/demos/streaming_asr_server/README_cn.md
@ -90,7 +90,7 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav

  server_executor = ServerExecutor()
  server_executor(
-      config_file="./conf/ws_conformer_wenetspeech_application", 
+      config_file="./conf/ws_conformer_wenetspeech_application_faster.yaml", 
      log_file="./log/paddlespeech.log")
  ```

--- a/docs/images/note_map.png
+++ b/docs/images/note_map.png
--- a/docs/requirements.txt
+++ b/docs/requirements.txt
@ -38,8 +38,8 @@ sphinx-markdown-tables
 sphinx_rtd_theme
 textgrid
 timer
-ToJyutping
-typeguard
+ToJyutping==0.2.1
+typeguard==2.13.3
 webrtcvad
 websockets
 yacs~=0.1.8
--- a/docs/source/released_model.md
+++ b/docs/source/released_model.md
@ -25,7 +25,7 @@ Model | Pre-Train Method | Pre-Train Data | Finetune Data | Size | Descriptions
 [Wav2vec2-large-960h-lv60-self Model](https://paddlespeech.bj.bcebos.com/wav2vec/wav2vec2-large-960h-lv60-self.pdparams) | wav2vec2 | Librispeech and LV-60k Dataset (5.3w h) | - | 1.18 GB |Pre-trained Wav2vec2.0 Model | - | - | - | 
 [Wav2vec2ASR-large-960h-librispeech Model](https://paddlespeech.bj.bcebos.com/s2t/librispeech/asr3/wav2vec2ASR-large-960h-librispeech_ckpt_1.3.1.model.tar.gz) | wav2vec2 | Librispeech and LV-60k Dataset (5.3w h) | Librispeech (960 h) | 718 MB |Encoder: Wav2vec2.0, Decoder: CTC, Decoding method: Greedy search | - | 0.0189 | [Wav2vecASR Librispeech ASR3](../../examples/librispeech/asr3) |
 [Wav2vec2-large-wenetspeech-self Model](https://paddlespeech.bj.bcebos.com/s2t/aishell/asr3/wav2vec2-large-wenetspeech-self_ckpt_1.3.0.model.tar.gz) | wav2vec2 | Wenetspeech Dataset (1w h) | - | 714 MB |Pre-trained Wav2vec2.0 Model | - | - | - | 
-[Wav2vec2ASR-large-aishell1 Model](https://paddlespeech.bj.bcebos.com/s2t/aishell/asr3/wav2vec2ASR-large-aishell1_ckpt_1.3.0.model.tar.gz) | wav2vec2 | Wenetspeech Dataset (1w h) | aishell1 (train set) | 1.17 GB |Encoder: Wav2vec2.0, Decoder: CTC, Decoding method: Greedy search | 0.0453 | - | - |
+[Wav2vec2ASR-large-aishell1 Model](https://paddlespeech.bj.bcebos.com/s2t/aishell/asr3/wav2vec2ASR-large-aishell1_ckpt_1.4.0.model.tar.gz) | wav2vec2 | Wenetspeech Dataset (1w h) | aishell1 (train set) | 1.18 GB |Encoder: Wav2vec2.0, Decoder: CTC, Decoding method: Greedy search | 0.0510 | - | - |

 ### Whisper Model
 Demo Link | Training Data | Size | Descriptions | CER | Model 
--- a/docs/source/tts/svs_music_score.md
+++ b/docs/source/tts/svs_music_score.md
@ -0,0 +1,183 @@
+本人非音乐专业人士，如文档中有误欢迎指正。
+
+# 一、常见基础
+## 1.1 简谱和音名（note）
+<p align="left">
+  <img src="https://paddlespeech.bj.bcebos.com/t2s/svs/svs_music_scores/seven.png" width="300"/>
+</p>
+
+上图从左往右的黑键音名分别是：C#/Db，D#/Db，F#/Db，G#/Ab，A#/Bb
+钢琴88键如下图，分为大字一组，大字组，小字组，小字一组，小字二组，小字三组，小字四组。分别对应音名的后缀是 1 2 3 4 5 6，例如小字一组（C大调）包含的键分别为： C4，C#4/Db4，D4，D#4/Eb4，E4，F4，F#4/Gb4，G4，G#4/Ab4，A4，A#4/Bb4，B4  
+钢琴八度音就是12345671八个音，最后一个音是高1。**遵循：全全半全全全半** 就会得到 1 2 3 4 5 6 7 (高)1 的音
+
+<p align="left">
+  <img src="https://paddlespeech.bj.bcebos.com/t2s/svs/svs_music_scores/piano_88.png" />
+</p>
+
+## 1.2 十二大调
+“#”表示升调
+
+<p align="left">
+  <img src="https://paddlespeech.bj.bcebos.com/t2s/svs/svs_music_scores/up.png" />
+</p>
+
+“b”表示降调
+
+<p align="left">
+  <img src="https://paddlespeech.bj.bcebos.com/t2s/svs/svs_music_scores/down.png" />
+</p>
+
+什么大调表示Do(简谱1) 这个音从哪个键开始，例如D大调，则用D这个键来表示 Do这个音。
+下图是十二大调下简谱与音名的对应表。
+
+<p align="left">
+  <img src="../../../docs/images/note_map.png" />
+</p>
+
+
+## 1.3 Tempo
+Tempo 用于表示速度（Speed of the beat/pulse），一分钟里面有几拍（beats per mimute BPM）
+
+<p align="left">
+  <img src="https://paddlespeech.bj.bcebos.com/t2s/svs/svs_music_scores/note_beat.png" width="450"/>
+</p>
+
+whole note -->  4 beats</br>
+half note --> 2 beats</br>
+quarter note --> 1 beat</br>
+eighth note --> 1/2 beat</br>
+sixteenth note --> 1/4 beat</br> 
+
+
+# 二、应用试验
+## 2.1 从谱中获取 music scores
+music scores 包含：note，note_dur，is_slur
+
+<p align="left">
+  <img src="https://paddlespeech.bj.bcebos.com/t2s/svs/svs_music_scores/pu.png" width="600"/>
+</p>
+
+从左上角的谱信息 *bE* 可以得出该谱子是 **降E大调**，可以对应1.2小节十二大调简谱音名对照表根据 简谱获取对应的note
+从左上角的谱信息 *quarter note* 可以得出该谱子的速度是 **一分钟95拍（beat）**，一拍的时长 = **60/95 = 0.631578s**
+从左上角的谱信息 *4/4* 可以得出该谱子表示四分音符为一拍（分母的4），每小节有4拍（分子的4）
+
+从该简谱上可以获取 music score 如下：
+
+|text |phone |简谱（辅助）后面的点表示高八音 |note （从小字组开始算） |几拍（辅助） |note_dur |is_slur|
+:-------------:| :------------:| :-----: | -----: | :-----: |:-----:| :-----:  |
+|小 |x   |5  |A#3/Bb3 |半 |0.315789 |0 |
+|   |iao |5  |A#3/Bb3 |半 |0.315789 |0 |
+|酒 |j   |1. |D#4/Eb4 |半 |0.315789 |0 |
+|   |iu  |1. |D#4/Eb4 |半 |0.315789 |0 |
+|窝 |w   |2. |F4      |半 |0.315789 |0 |
+|   |o   |2. |F4      |半 |0.315789 |0 |
+|长 |ch  |3. |G4      |半 |0.315789 |0 |
+|   |ang |3. |G4      |半 |0.315789 |0 |
+|   |ang |1. |D#4/Eb4 |半 |0.315789 |1 |
+|睫 |j   |1. |D#4/Eb4 |半 |0.315789 |0 |
+|   |ie  |1. |D#4/Eb4 |半 |0.315789 |0 |
+|   |ie  |5  |A#3/Bb3 |半 |0.315789 |1 |
+|毛 |m   |5  |A#3/Bb3 |一 |0.631578 |0 |
+|   |ao  |5  |A#3/Bb3 |一 |0.631578 |0 |
+|是 |sh  |5  |A#3/Bb3 |半 |0.315789 |0 |
+|   |i   |5  |A#3/Bb3 |半 |0.315789 |0 |
+|你 |n   |3. |G4      |半 |0.315789 |0 |
+|   |i   |3. |G4      |半 |0.315789 |0 |
+|最 |z   |2. |F4      |半 |0.315789 |0 |
+|   |ui  |2. |F4      |半 |0.315789 |0 |
+|美 |m   |3. |G4      |半 |0.315789 |0 |
+|   |ei  |3. |G4      |半 |0.315789 |0 |
+|的 |d   |2. |F4      |半 |0.315789 |0 |
+|   |e   |2. |F4      |半 |0.315789 |0 |
+|记 |j   |7  |D4      |半 |0.315789 |0 |
+|   |i   |7  |D4      |半 |0.315789 |0 |
+|号 |h   |5  |A#3/Bb3 |半 |0.315789 |0 |
+|   |ao  |5  |A#3/Bb3 |半 |0.315789 |0 |
+
+
+## 2.2 一些实验
+
+<div align = "center">
+<table style="width:100%">
+  <thead>
+    <tr>
+      <th> 序号  </th>
+      <th width="500"> 说明  </th>
+      <th> 合成音频（diffsinger_opencpop + pwgan_opencpop） </th>
+    </tr>
+  </thead>
+  <tbody>
+    <tr>
+      <td > 1 </td>
+      <td > 原始 opencpop 标注的 notes，note_durs，is_slurs，升F大调，起始在小字组（第3组） </td>
+      <td align = "center">
+      <a href="https://paddlespeech.bj.bcebos.com/t2s/svs/svs_music_scores/test1.wav" rel="nofollow">
+            <img align="center" src="../../../docs/images/audio_icon.png" width="200 style="max-width: 100%;"></a><br>
+      </td>
+    </tr>
+    <tr>
+      <td > 2 </td>
+      <td > 原始 opencpop 标注的 notes 和 is_slurs，note_durs 改变（从谱子获取） </td>
+      <td align = "center">
+      <a href="https://paddlespeech.bj.bcebos.com/t2s/svs/svs_music_scores/test2.wav" rel="nofollow">
+            <img align="center" src="../../../docs/images/audio_icon.png" width="200 style="max-width: 100%;"></a><br>
+      </td>
+    </tr>
+    <tr>
+      <td > 3 </td>
+      <td > 原始 opencpop 标注的 notes 去掉 rest（毛字一拍），is_slurs 和 note_durs 改变（从谱子获取） </td>
+      <td align = "center">
+      <a href="https://paddlespeech.bj.bcebos.com/t2s/svs/svs_music_scores/test3.wav" rel="nofollow">
+            <img align="center" src="../../../docs/images/audio_icon.png" width="200 style="max-width: 100%;"></a><br>
+      </td>
+    </tr>
+    <tr>
+      <td > 4 </td>
+      <td > 从谱子获取 notes，note durs，is_slurs，不含 rest（毛字一拍），起始在小字一组（第3组） </td>
+      <td align = "center">
+      <a href="https://paddlespeech.bj.bcebos.com/t2s/svs/svs_music_scores/test4.wav" rel="nofollow">
+            <img align="center" src="../../../docs/images/audio_icon.png" width="200 style="max-width: 100%;"></a><br>
+      </td>
+    </tr>
+    <tr>
+      <td > 5 </td>
+      <td > 从谱子获取 notes，note durs，is_slurs，加上 rest （毛字半拍，rest半拍），起始在小字一组（第3组）</td>
+      <td align = "center">
+      <a href="https://paddlespeech.bj.bcebos.com/t2s/svs/svs_music_scores/test5.wav" rel="nofollow">
+            <img align="center" src="../../../docs/images/audio_icon.png" width="200 style="max-width: 100%;"></a><br>
+      </td>
+    </tr>
+    <tr>
+      <td > 6 </td>
+      <td > 从谱子获取 notes， is_slurs，包含 rest，note_durs 从原始标注获取，起始在小字一组（第3组） </td>
+      <td align = "center">
+      <a href="https://paddlespeech.bj.bcebos.com/t2s/svs/svs_music_scores/test6.wav" rel="nofollow">
+            <img align="center" src="../../../docs/images/audio_icon.png" width="200 style="max-width: 100%;"></a><br>
+      </td>
+    </tr>
+    <tr>
+      <td > 7 </td>
+      <td > 从谱子获取 notes，note durs，is_slurs，不含 rest（毛字一拍），起始在小字一组（第4组） </td>
+      <td align = "center">
+      <a href="https://paddlespeech.bj.bcebos.com/t2s/svs/svs_music_scores/test7.wav" rel="nofollow">
+            <img align="center" src="../../../docs/images/audio_icon.png" width="200 style="max-width: 100%;"></a><br>
+      </td>
+    </tr>
+    
+  </tbody>
+</table>
+
+</div>
+
+
+上述实验表明通过该方法来提取 music score 是可行的，但是在应用中可以**灵活地在歌词中加"AP"(用来表示吸气声)和"SP"(用来表示停顿声)**，对应的在 **note 上加 rest**，会使得整体的歌声合成更自然。
+除此之外，还要考虑哪一个大调并且以哪一组为起始**得到的 note 在训练数据集中出现过**，如若推理时传入训练数据中没有见过的 note， 合成出来的音频可能不是我们期待的音调。
+
+
+# 三、其他
+## 3.1 读取midi
+
+```python
+import mido
+mid = mido.MidiFile('2093.midi')
+```
--- a/examples/aishell/asr1/conf/chunk_squeezeformer.yaml
+++ b/examples/aishell/asr1/conf/chunk_squeezeformer.yaml
@ -0,0 +1,98 @@
+############################################
+#           Network Architecture           #
+############################################
+cmvn_file: 
+cmvn_file_type: "json"
+# encoder related
+encoder: squeezeformer
+encoder_conf:
+    encoder_dim: 256    # dimension of attention
+    output_size: 256    # dimension of output
+    attention_heads: 4
+    num_blocks: 12      # the number of encoder blocks
+    reduce_idx: 5
+    recover_idx: 11
+    feed_forward_expansion_factor: 8
+    input_dropout_rate: 0.1
+    feed_forward_dropout_rate: 0.1
+    attention_dropout_rate: 0.1
+    adaptive_scale: true
+    cnn_module_kernel: 31
+    normalize_before: false
+    activation_type: 'swish'
+    pos_enc_layer_type: 'rel_pos'
+    time_reduction_layer_type: 'stream'
+    causal: true
+    use_dynamic_chunk: true
+    use_dynamic_left_chunk: false
+
+# decoder related
+decoder: transformer
+decoder_conf:
+    attention_heads: 4
+    linear_units: 2048
+    num_blocks: 6
+    dropout_rate: 0.1  # sublayer output dropout
+    positional_dropout_rate: 0.1
+    self_attention_dropout_rate: 0.0
+    src_attention_dropout_rate: 0.0
+# hybrid CTC/attention
+model_conf:
+    ctc_weight: 0.3
+    lsm_weight: 0.1     # label smoothing option
+    length_normalized_loss: false
+    init_type: 'kaiming_uniform' # !Warning: need to convergence
+
+###########################################
+#                   Data                  #
+###########################################
+
+train_manifest: data/manifest.train
+dev_manifest: data/manifest.dev
+test_manifest: data/manifest.test
+
+
+###########################################
+#              Dataloader                 #
+###########################################
+
+vocab_filepath: data/lang_char/vocab.txt 
+spm_model_prefix: ''
+unit_type: 'char'
+preprocess_config: conf/preprocess.yaml
+feat_dim: 80
+stride_ms: 10.0
+window_ms: 25.0
+sortagrad: 0 # Feed samples from shortest to longest ; -1: enabled for all epochs, 0: disabled, other: enabled for 'other' epochs 
+batch_size: 32
+maxlen_in: 512  # if input length  > maxlen-in, batchsize is automatically reduced
+maxlen_out: 150  # if output length > maxlen-out, batchsize is automatically reduced
+minibatches: 0 # for debug
+batch_count: auto
+batch_bins: 0
+batch_frames_in: 0
+batch_frames_out: 0
+batch_frames_inout: 0
+num_workers: 2
+subsampling_factor: 1
+num_encs: 1
+
+###########################################
+#                 Training                #
+###########################################
+n_epoch: 240 
+accum_grad: 1
+global_grad_clip: 5.0
+dist_sampler: True
+optim: adam
+optim_conf:
+  lr: 0.001
+  weight_decay: 1.0e-6
+scheduler: warmuplr
+scheduler_conf:
+  warmup_steps: 25000
+  lr_decay: 1.0
+log_interval: 100
+checkpoint:
+  kbest_n: 50
+  latest_n: 5
--- a/examples/aishell/asr1/conf/squeezeformer.yaml
+++ b/examples/aishell/asr1/conf/squeezeformer.yaml
@ -0,0 +1,93 @@
+############################################
+#           Network Architecture           #
+############################################
+cmvn_file: 
+cmvn_file_type: "json"
+# encoder related
+encoder: squeezeformer
+encoder_conf:
+    encoder_dim: 256    # dimension of attention
+    output_size: 256    # dimension of output
+    attention_heads: 4
+    num_blocks: 12      # the number of encoder blocks
+    reduce_idx: 5
+    recover_idx: 11
+    feed_forward_expansion_factor: 8
+    input_dropout_rate: 0.1
+    feed_forward_dropout_rate: 0.1
+    attention_dropout_rate: 0.1
+    adaptive_scale: true
+    cnn_module_kernel: 31
+    normalize_before: false
+    activation_type: 'swish'
+    pos_enc_layer_type: 'rel_pos'
+    time_reduction_layer_type: 'conv1d'
+
+# decoder related
+decoder: transformer
+decoder_conf:
+    attention_heads: 4
+    linear_units: 2048
+    num_blocks: 6
+    dropout_rate: 0.1
+    positional_dropout_rate: 0.1
+    self_attention_dropout_rate: 0.0
+    src_attention_dropout_rate: 0.0
+
+# hybrid CTC/attention
+model_conf:
+    ctc_weight: 0.3
+    lsm_weight: 0.1     # label smoothing option
+    length_normalized_loss: false
+    init_type: 'kaiming_uniform' # !Warning: need to convergence
+
+###########################################
+#                   Data                  #
+###########################################
+train_manifest: data/manifest.train
+dev_manifest: data/manifest.dev
+test_manifest: data/manifest.test
+
+###########################################
+#              Dataloader                 #
+###########################################
+vocab_filepath: data/lang_char/vocab.txt 
+spm_model_prefix: ''
+unit_type: 'char'
+preprocess_config: conf/preprocess.yaml
+feat_dim: 80
+stride_ms: 10.0
+window_ms: 25.0
+sortagrad: 0 # Feed samples from shortest to longest ; -1: enabled for all epochs, 0: disabled, other: enabled for 'other' epochs 
+batch_size: 32
+maxlen_in: 512  # if input length  > maxlen-in, batchsize is automatically reduced
+maxlen_out: 150  # if output length > maxlen-out, batchsize is automatically reduced
+minibatches: 0 # for debug
+batch_count: auto
+batch_bins: 0 
+batch_frames_in: 0
+batch_frames_out: 0
+batch_frames_inout: 0
+num_workers: 2
+subsampling_factor: 1
+num_encs: 1
+
+###########################################
+#                Training                 #
+###########################################
+n_epoch: 150 
+accum_grad: 8
+global_grad_clip: 5.0
+dist_sampler: False
+optim: adam
+optim_conf:
+  lr: 0.002
+  weight_decay: 1.0e-6
+scheduler: warmuplr
+scheduler_conf:
+  warmup_steps: 25000
+  lr_decay: 1.0
+log_interval: 100
+checkpoint:
+  kbest_n: 50
+  latest_n: 5
--- a/examples/aishell/asr3/README.md
+++ b/examples/aishell/asr3/README.md
@ -164,8 +164,8 @@ using the `tar` scripts to unpack the model and then you can use the script to t

 For example:
 ```bash
-wget https://paddlespeech.bj.bcebos.com/s2t/aishell/asr3/wav2vec2ASR-large-aishell1_ckpt_1.3.0.model.tar.gz
-tar xzvf wav2vec2ASR-large-aishell1_ckpt_1.3.0.model.tar.gz
+wget https://paddlespeech.bj.bcebos.com/s2t/aishell/asr3/wav2vec2ASR-large-aishell1_ckpt_1.4.0.model.tar.gz
+tar xzvf wav2vec2ASR-large-aishell1_ckpt_1.4.0.model.tar.gz
 source path.sh
 # If you have process the data and get the manifest file， you can skip the following 2 steps
 bash local/data.sh --stage -1 --stop_stage -1
@ -185,14 +185,14 @@ In some situations, you want to use the trained model to do the inference for th
 ```
 you can train the model by yourself using ```bash run.sh --stage 0 --stop_stage 3```, or you can download the pretrained model through the script below:
 ```bash
-wget https://paddlespeech.bj.bcebos.com/s2t/aishell/asr3/wav2vec2ASR-large-aishell1_ckpt_1.3.0.model.tar.gz
-tar xzvf wav2vec2ASR-large-aishell1_ckpt_1.3.0.model.tar.gz
+wget https://paddlespeech.bj.bcebos.com/s2t/aishell/asr3/wav2vec2ASR-large-aishell1_ckpt_1.4.0.model.tar.gz
+tar xzvf wav2vec2ASR-large-aishell1_ckpt_1.4.0.model.tar.gz
 ```
 You can download the audio demo:
 ```bash
-wget -nc https://paddlespeech.bj.bcebos.com/datasets/single_wav/en/demo_002_en.wav -P data/
+wget -nc https://paddlespeech.bj.bcebos.com/datasets/single_wav/zh/demo_01_03.wav -P data/
 ```
 You need to prepare an audio file or use the audio demo above, please confirm the sample rate of the audio is 16K. You can get the result of the audio demo by running the script below.
 ```bash
-CUDA_VISIBLE_DEVICES= ./local/test_wav.sh conf/wav2vec2ASR.yaml conf/tuning/decode.yaml exp/wav2vec2ASR/checkpoints/avg_1 data/demo_002_en.wav
+CUDA_VISIBLE_DEVICES= ./local/test_wav.sh conf/wav2vec2ASR.yaml conf/tuning/decode.yaml exp/wav2vec2ASR/checkpoints/avg_1 data/demo_01_03.wav
 ```
--- a/examples/aishell/asr3/RESULT.md
+++ b/examples/aishell/asr3/RESULT.md
@ -0,0 +1,18 @@
+# AISHELL
+
+## Version
+
+* paddle version: develop (commit id: daea892c67e85da91906864de40ce9f6f1b893ae)
+* paddlespeech version: develop (commit id: c14b4238b256693281e59605abff7c9435b3e2b2)
+* paddlenlp version: 2.5.2 
+
+## Device
+* python: 3.7
+* cuda: 10.2
+* cudnn: 7.6
+
+## Result
+train: Epoch 80, 2*V100-32G, batchsize:5
+| Model | Params | Config | Augmentation| Test set | Decode method | WER |  
+| --- | --- | --- | --- | --- | --- | --- |
+| wav2vec2ASR | 324.49 M | conf/wav2vec2ASR.yaml | spec_aug | test-set | greedy search | 5.1009 |  
--- a/examples/aishell/asr3/conf/train_with_wav2vec.yaml
+++ b/examples/aishell/asr3/conf/train_with_wav2vec.yaml
@ -83,7 +83,7 @@ dnn_neurons: 1024
 freeze_wav2vec: False
 dropout: 0.15

-tokenizer: !apply:transformers.BertTokenizer.from_pretrained
+tokenizer: !apply:paddlenlp.transformers.AutoTokenizer.from_pretrained
   pretrained_model_name_or_path: bert-base-chinese
 # bert-base-chinese tokens length
 output_neurons: 21128
--- a/examples/aishell/asr3/conf/wav2vec2ASR.yaml
+++ b/examples/aishell/asr3/conf/wav2vec2ASR.yaml
@ -107,6 +107,7 @@ vocab_filepath: data/lang_char/vocab.txt
 ###########################################

 unit_type: 'char'
+tokenizer: bert-base-chinese
 mean_std_filepath: 
 preprocess_config: conf/preprocess.yaml
 sortagrad: -1 # Feed samples from shortest to longest ; -1: enabled for all epochs, 0: disabled, other: enabled for 'other' epochs 
@ -139,12 +140,10 @@ n_epoch: 80
 accum_grad: 1
 global_grad_clip: 5.0

-model_optim: adadelta
+model_optim: sgd
 model_optim_conf:
  lr: 1.0
  weight_decay: 0.0
-  rho: 0.95
-  epsilon: 1.0e-8

 wav2vec2_optim: adam
 wav2vec2_optim_conf:
@ -165,3 +164,4 @@ log_interval: 1
 checkpoint:
  kbest_n: 50
  latest_n: 5
+
--- a/examples/aishell/asr3/conf/wav2vec2ASR_adadelta.yaml
+++ b/examples/aishell/asr3/conf/wav2vec2ASR_adadelta.yaml
@ -0,0 +1,168 @@
+############################################
+#          Network Architecture           #
+############################################
+freeze_wav2vec2: False
+normalize_wav: True
+output_norm: True
+init_type: 'kaiming_uniform' # !Warning: need to convergence
+enc:
+  input_shape: 1024
+  dnn_blocks: 3
+  dnn_neurons: 1024
+  activation: True
+  normalization: True
+  dropout_rate: [0.15, 0.15, 0.0]
+ctc:
+  enc_n_units: 1024
+  blank_id: 0
+  dropout_rate: 0.0
+
+audio_augment:
+  speeds: [90, 100, 110]
+
+spec_augment:
+  time_warp: True
+  time_warp_window: 5
+  time_warp_mode: bicubic
+  freq_mask: True
+  n_freq_mask: 2
+  time_mask: True
+  n_time_mask: 2
+  replace_with_zero: False
+  freq_mask_width: 30
+  time_mask_width: 40
+wav2vec2_params_path: exp/wav2vec2/chinese-wav2vec2-large.pdparams
+
+
+############################################
+#               Wav2Vec2.0                 #
+############################################
+# vocab_size: 1000000
+hidden_size: 1024
+num_hidden_layers: 24
+num_attention_heads: 16
+intermediate_size: 4096
+hidden_act: gelu
+hidden_dropout: 0.1
+activation_dropout: 0.0
+attention_dropout: 0.1
+feat_proj_dropout: 0.1
+feat_quantizer_dropout: 0.0
+final_dropout: 0.0
+layerdrop: 0.1
+initializer_range: 0.02
+layer_norm_eps: 1e-5
+feat_extract_norm: layer
+feat_extract_activation: gelu
+conv_dim: [512, 512, 512, 512, 512, 512, 512]
+conv_stride: [5, 2, 2, 2, 2, 2, 2]
+conv_kernel: [10, 3, 3, 3, 3, 2, 2]
+conv_bias: True
+num_conv_pos_embeddings: 128
+num_conv_pos_embedding_groups: 16
+do_stable_layer_norm: True
+apply_spec_augment: False
+mask_channel_length: 10
+mask_channel_min_space: 1
+mask_channel_other: 0.0
+mask_channel_prob: 0.0
+mask_channel_selection: static
+mask_feature_length: 10
+mask_feature_min_masks: 0
+mask_feature_prob: 0.0
+mask_time_length: 10
+mask_time_min_masks: 2
+mask_time_min_space: 1
+mask_time_other: 0.0
+mask_time_prob: 0.075
+mask_time_selection: static
+num_codevectors_per_group: 320
+num_codevector_groups: 2
+contrastive_logits_temperature: 0.1
+num_negatives: 100
+codevector_dim: 256
+proj_codevector_dim: 256
+diversity_loss_weight: 0.1
+use_weighted_layer_sum: False
+# pad_token_id: 0
+# bos_token_id: 1
+# eos_token_id: 2
+add_adapter: False
+adapter_kernel_size: 3
+adapter_stride: 2
+num_adapter_layers: 3
+output_hidden_size: None
+
+###########################################
+#                   Data                  #
+###########################################
+
+train_manifest: data/manifest.train
+dev_manifest: data/manifest.dev
+test_manifest: data/manifest.test
+vocab_filepath: data/lang_char/vocab.txt 
+
+###########################################
+#              Dataloader                 #
+###########################################
+
+unit_type: 'char'
+tokenizer: bert-base-chinese
+mean_std_filepath: 
+preprocess_config: conf/preprocess.yaml
+sortagrad: -1 # Feed samples from shortest to longest ; -1: enabled for all epochs, 0: disabled, other: enabled for 'other' epochs 
+batch_size: 5  # Different batch_size may cause large differences in results
+maxlen_in: 51200000000  # if input length  > maxlen-in batchsize is automatically reduced
+maxlen_out: 1500000  # if output length > maxlen-out batchsize is automatically reduced
+minibatches: 0 # for debug
+batch_count: auto
+batch_bins: 0 
+batch_frames_in: 0
+batch_frames_out: 0
+batch_frames_inout: 0
+num_workers: 6
+subsampling_factor: 1
+num_encs: 1
+dist_sampler: True
+shortest_first: True
+return_lens_rate: True
+
+###########################################
+#        use speechbrain dataloader       #
+###########################################
+use_sb_pipeline: True  # whether use speechbrain pipeline. Default is True.
+sb_pipeline_conf: conf/train_with_wav2vec.yaml
+
+###########################################
+#                 Training                #
+###########################################
+n_epoch: 80
+accum_grad: 1
+global_grad_clip: 5.0
+
+model_optim: adadelta
+model_optim_conf:
+  lr: 1.0
+  weight_decay: 0.0
+  rho: 0.95
+  epsilon: 1.0e-8
+
+wav2vec2_optim: adam
+wav2vec2_optim_conf:
+  lr: 0.0001
+  weight_decay: 0.0
+
+model_scheduler: newbobscheduler
+model_scheduler_conf:
+  improvement_threshold: 0.0025
+  annealing_factor: 0.8
+  patient: 0
+wav2vec2_scheduler: newbobscheduler
+wav2vec2_scheduler_conf:
+  improvement_threshold: 0.0025
+  annealing_factor: 0.9
+  patient: 0
+log_interval: 1
+checkpoint:
+  kbest_n: 50
+  latest_n: 5
--- a/examples/aishell/asr3/local/aishell_prepare.py
+++ b/examples/aishell/asr3/local/aishell_prepare.py
@ -21,7 +21,7 @@ import glob
 import logging
 import os

-from paddlespeech.s2t.models.wav2vec2.io.dataio import read_audio
+from paddlespeech.s2t.io.speechbrain.dataio import read_audio

 logger = logging.getLogger(__name__)

--- a/examples/aishell/asr3/local/data.sh
+++ b/examples/aishell/asr3/local/data.sh
@ -1,7 +1,7 @@
 #!/bin/bash

 stage=-1
-stop_stage=-1
+stop_stage=3
 dict_dir=data/lang_char

 . ${MAIN_ROOT}/utils/parse_options.sh || exit -1;
--- a/examples/aishell/asr3/local/test.sh
+++ b/examples/aishell/asr3/local/test.sh
@ -8,9 +8,7 @@ echo "using $ngpu gpus..."
 expdir=exp
 datadir=data

-train_set=train_960
-recog_set="test-clean test-other dev-clean dev-other"
-recog_set="test-clean"
+train_set=train

 config_path=$1
 decode_config_path=$2
@ -75,7 +73,7 @@ for type in ctc_prefix_beam_search; do
        --trans_hyp ${ckpt_prefix}.${type}.rsl.text

    python3 utils/compute-wer.py --char=1 --v=1 \
-        data/manifest.test-clean.text ${ckpt_prefix}.${type}.rsl.text > ${ckpt_prefix}.${type}.error
+        data/manifest.test.text ${ckpt_prefix}.${type}.rsl.text > ${ckpt_prefix}.${type}.error
    echo "decoding ${type} done."
 done

--- a/examples/aishell/asr3/local/test_wav.sh
+++ b/examples/aishell/asr3/local/test_wav.sh
@ -14,7 +14,7 @@ ckpt_prefix=$3
 audio_file=$4

 mkdir -p data
-wget -nc https://paddlespeech.bj.bcebos.com/datasets/single_wav/en/demo_002_en.wav -P data/
+wget -nc https://paddlespeech.bj.bcebos.com/datasets/single_wav/zh/demo_01_03.wav -P data/
 if [ $? -ne 0 ]; then
   exit 1
 fi
--- a/examples/aishell/asr3/run.sh
+++ b/examples/aishell/asr3/run.sh
@ -15,11 +15,11 @@ resume=         # xx e.g. 30
 export FLAGS_cudnn_deterministic=1
 . ${MAIN_ROOT}/utils/parse_options.sh || exit 1;

-audio_file=data/demo_002_en.wav
+audio_file=data/demo_01_03.wav

 avg_ckpt=avg_${avg_num}
 ckpt=$(basename ${conf_path} | awk -F'.' '{print $1}')
-echo "checkpoint name ${ckpt}"git revert -v 
+echo "checkpoint name ${ckpt}" 

 if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
    # prepare data
--- a/examples/aishell3/tts3/run.sh
+++ b/examples/aishell3/tts3/run.sh
@ -43,10 +43,7 @@ fi

 if [ ${stage} -le 5 ] && [ ${stop_stage} -ge 5 ]; then
    # install paddle2onnx
-    version=$(echo `pip list |grep "paddle2onnx"` |awk -F" " '{print $2}')
-    if [[ -z "$version" || ${version} != '1.0.0' ]]; then
-        pip install paddle2onnx==1.0.0
-    fi
+    pip install paddle2onnx --upgrade
    ./local/paddle2onnx.sh ${train_output_path} inference inference_onnx fastspeech2_aishell3
    # considering the balance between speed and quality, we recommend that you use hifigan as vocoder
    ./local/paddle2onnx.sh ${train_output_path} inference inference_onnx pwgan_aishell3
--- a/examples/canton/tts3/run.sh
+++ b/examples/canton/tts3/run.sh
@ -46,10 +46,7 @@ fi
 # we have only tested the following models so far
 if [ ${stage} -le 5 ] && [ ${stop_stage} -ge 5 ]; then
    # install paddle2onnx
-    version=$(echo `pip list |grep "paddle2onnx"` |awk -F" " '{print $2}')
-    if [[ -z "$version" || ${version} != '1.0.0' ]]; then
-        pip install paddle2onnx==1.0.0
-    fi
+    pip install paddle2onnx --upgrade
    ../../csmsc/tts3/local/paddle2onnx.sh ${train_output_path} inference inference_onnx fastspeech2_canton
    # considering the balance between speed and quality, we recommend that you use hifigan as vocoder
    # ./local/paddle2onnx.sh ${train_output_path} inference inference_onnx pwgan_csmsc
--- a/examples/csmsc/tts2/run.sh
+++ b/examples/csmsc/tts2/run.sh
@ -45,10 +45,7 @@ fi
 # we have only tested the following models so far
 if [ ${stage} -le 5 ] && [ ${stop_stage} -ge 5 ]; then
    # install paddle2onnx
-    version=$(echo `pip list |grep "paddle2onnx"` |awk -F" " '{print $2}')
-    if [[ -z "$version" || ${version} != '1.0.0' ]]; then
-        pip install paddle2onnx==1.0.0
-    fi
+    pip install paddle2onnx --upgrade
    ./local/paddle2onnx.sh ${train_output_path} inference inference_onnx speedyspeech_csmsc
    # considering the balance between speed and quality, we recommend that you use hifigan as vocoder
    ./local/paddle2onnx.sh ${train_output_path} inference inference_onnx pwgan_csmsc
--- a/examples/csmsc/tts3/run.sh
+++ b/examples/csmsc/tts3/run.sh
@ -45,10 +45,7 @@ fi
 # we have only tested the following models so far
 if [ ${stage} -le 5 ] && [ ${stop_stage} -ge 5 ]; then
    # install paddle2onnx
-    version=$(echo `pip list |grep "paddle2onnx"` |awk -F" " '{print $2}')
-    if [[ -z "$version" || ${version} != '1.0.0' ]]; then
-        pip install paddle2onnx==1.0.0
-    fi
+    pip install paddle2onnx --upgrade
    ./local/paddle2onnx.sh ${train_output_path} inference inference_onnx fastspeech2_csmsc
    # considering the balance between speed and quality, we recommend that you use hifigan as vocoder
    ./local/paddle2onnx.sh ${train_output_path} inference inference_onnx pwgan_csmsc
--- a/examples/csmsc/tts3/run_cnndecoder.sh
+++ b/examples/csmsc/tts3/run_cnndecoder.sh
@ -58,10 +58,7 @@ fi
 # paddle2onnx non streaming
 if [ ${stage} -le 7 ] && [ ${stop_stage} -ge 7 ]; then
    # install paddle2onnx
-    version=$(echo `pip list |grep "paddle2onnx"` |awk -F" " '{print $2}')
-    if [[ -z "$version" || ${version} != '1.0.0' ]]; then
-        pip install paddle2onnx==1.0.0
-    fi
+    pip install paddle2onnx --upgrade
    ./local/paddle2onnx.sh ${train_output_path} inference inference_onnx fastspeech2_csmsc
    # considering the balance between speed and quality, we recommend that you use hifigan as vocoder
    ./local/paddle2onnx.sh ${train_output_path} inference inference_onnx pwgan_csmsc
@ -77,10 +74,7 @@ fi
 # paddle2onnx streaming
 if [ ${stage} -le 9 ] && [ ${stop_stage} -ge 9 ]; then
    # install paddle2onnx
-    version=$(echo `pip list |grep "paddle2onnx"` |awk -F" " '{print $2}')
-    if [[ -z "$version" || ${version} != '1.0.0' ]]; then
-        pip install paddle2onnx==1.0.0
-    fi
+    pip install paddle2onnx --upgrade
    # streaming acoustic model
    ./local/paddle2onnx.sh ${train_output_path} inference_streaming inference_onnx_streaming fastspeech2_csmsc_am_encoder_infer
    ./local/paddle2onnx.sh ${train_output_path} inference_streaming inference_onnx_streaming fastspeech2_csmsc_am_decoder
--- a/examples/csmsc/vits/local/paddle2onnx.sh
+++ b/examples/csmsc/vits/local/paddle2onnx.sh
@ -0,0 +1 @@
+../../tts3/local/paddle2onnx.sh
--- a/examples/csmsc/vits/run.sh
+++ b/examples/csmsc/vits/run.sh
@ -39,3 +39,31 @@ fi
 if [ ${stage} -le 4 ] && [ ${stop_stage} -ge 4 ]; then
    CUDA_VISIBLE_DEVICES=${gpus} ./local/inference.sh ${train_output_path} ${add_blank}|| exit -1
 fi
+
+# # not ready yet for operator missing in Paddle2ONNX
+# # paddle2onnx, please make sure the static models are in ${train_output_path}/inference first
+# # we have only tested the following models so far
+# if [ ${stage} -le 5 ] && [ ${stop_stage} -ge 5 ]; then
+#     # install paddle2onnx
+#     pip install paddle2onnx --upgrade
+#     ./local/paddle2onnx.sh ${train_output_path} inference inference_onnx vits_csmsc
+# fi
+
+# # inference with onnxruntime
+# if [ ${stage} -le 6 ] && [ ${stop_stage} -ge 6 ]; then
+#     ./local/ort_predict.sh ${train_output_path}
+# fi
+
+# # not ready yet for operator missing in Paddle-Lite
+# # must run after stage 3 (which stage generated static models)
+# if [ ${stage} -le 7 ] && [ ${stop_stage} -ge 7 ]; then
+#     # NOTE by yuantian 2022.11.21: please compile develop version of Paddle-Lite to export and run TTS models,
+#     #                   cause TTS models are supported by https://github.com/PaddlePaddle/Paddle-Lite/pull/9587 
+#     #                   and https://github.com/PaddlePaddle/Paddle-Lite/pull/9706
+#     ./local/export2lite.sh ${train_output_path} inference pdlite vits_csmsc x86
+# fi
+
+# if [ ${stage} -le 8 ] && [ ${stop_stage} -ge 8 ]; then
+#     CUDA_VISIBLE_DEVICES=${gpus} ./local/lite_predict.sh ${train_output_path} || exit -1
+# fi
+
--- a/examples/librispeech/asr3/local/data.sh
+++ b/examples/librispeech/asr3/local/data.sh
--- a/examples/librispeech/asr3/local/test.sh
+++ b/examples/librispeech/asr3/local/test.sh
--- a/examples/librispeech/asr3/local/test_wav.sh
+++ b/examples/librispeech/asr3/local/test_wav.sh
--- a/examples/librispeech/asr3/local/train.sh
+++ b/examples/librispeech/asr3/local/train.sh
--- a/examples/librispeech/asr3/run.sh
+++ b/examples/librispeech/asr3/run.sh
@ -6,7 +6,7 @@ set -e

 gpus=0
 stage=0
-stop_stage=0
+stop_stage=4
 conf_path=conf/wav2vec2ASR.yaml
 ips=            #xx.xx.xx.xx,xx.xx.xx.xx
 decode_conf_path=conf/tuning/decode.yaml
--- a/examples/ljspeech/tts3/run.sh
+++ b/examples/ljspeech/tts3/run.sh
@ -45,10 +45,7 @@ fi
 # we have only tested the following models so far
 if [ ${stage} -le 5 ] && [ ${stop_stage} -ge 5 ]; then
    # install paddle2onnx
-    version=$(echo `pip list |grep "paddle2onnx"` |awk -F" " '{print $2}')
-    if [[ -z "$version" || ${version} != '1.0.0' ]]; then
-        pip install paddle2onnx==1.0.0
-    fi
+    pip install paddle2onnx --upgrade
    ./local/paddle2onnx.sh ${train_output_path} inference inference_onnx fastspeech2_ljspeech
    # considering the balance between speed and quality, we recommend that you use hifigan as vocoder
    ./local/paddle2onnx.sh ${train_output_path} inference inference_onnx pwgan_ljspeech
--- a/examples/opencpop/README.md
+++ b/examples/opencpop/README.md
@ -0,0 +1,6 @@
+
+# Opencpop
+
+* svs1 - DiffSinger
+* voc1 - Parallel WaveGAN
+* voc5 - HiFiGAN
--- a/examples/opencpop/svs1/README.md
+++ b/examples/opencpop/svs1/README.md
@ -0,0 +1,276 @@
+([简体中文](./README_cn.md)|English)
+# DiffSinger with Opencpop
+This example contains code used to train a [DiffSinger](https://arxiv.org/abs/2105.02446) model with [Mandarin singing corpus](https://wenet.org.cn/opencpop/).
+
+## Dataset
+### Download and Extract
+Download Opencpop from it's [Official Website](https://wenet.org.cn/opencpop/download/) and extract it to `~/datasets`. Then the dataset is in the directory `~/datasets/Opencpop`.
+
+## Get Started
+Assume the path to the dataset is `~/datasets/Opencpop`.
+Run the command below to
+1. **source path**.
+2. preprocess the dataset.
+3. train the model.
+4. synthesize wavs.
+    - synthesize waveform from `metadata.jsonl`.
+    - (Supporting) synthesize waveform from a text file. 
+5. (Supporting) inference using the static model.
+```bash
+./run.sh
+```
+You can choose a range of stages you want to run, or set `stage` equal to `stop-stage` to use only one stage, for example, running the following command will only preprocess the dataset.
+```bash
+./run.sh --stage 0 --stop-stage 0
+```
+### Data Preprocessing
+```bash
+./local/preprocess.sh ${conf_path}
+```
+When it is done. A `dump` folder is created in the current directory. The structure of the dump folder is listed below.
+
+```text
+dump
+├── dev
+│   ├── norm
+│   └── raw
+├── phone_id_map.txt
+├── speaker_id_map.txt
+├── test
+│   ├── norm
+│   └── raw
+└── train
+    ├── energy_stats.npy
+    ├── norm
+    ├── pitch_stats.npy
+    ├── raw
+    ├── speech_stats.npy
+    └── speech_stretchs.npy
+
+```
+The dataset is split into 3 parts, namely `train`, `dev`, and` test`, each of which contains a `norm` and `raw` subfolder. The raw folder contains speech, pitch and energy features of each utterance, while the norm folder contains normalized ones. The statistics used to normalize features are computed from the training set, which is located in `dump/train/*_stats.npy`. `speech_stretchs.npy` contains the minimum and maximum values of each dimension of the mel spectrum, which is used for linear stretching before training/inference of the diffusion module.
+Note: Since the training effect of non-norm features is due to norm, the features saved under `norm` are features that have not been normed.
+
+
+Also, there is a `metadata.jsonl` in each subfolder. It is a table-like file that contains utterance id, speaker id, phones, text_lengths, speech_lengths, phone durations, the path of speech features, the path of pitch features, the path of energy features, note, note durations, slur.
+
+### Model Training
+```bash
+CUDA_VISIBLE_DEVICES=${gpus} ./local/train.sh ${conf_path} ${train_output_path}
+```
+`./local/train.sh` calls `${BIN_DIR}/train.py`.
+Here's the complete help message.
+```text
+usage: train.py [-h] [--config CONFIG] [--train-metadata TRAIN_METADATA]
+                [--dev-metadata DEV_METADATA] [--output-dir OUTPUT_DIR]
+                [--ngpu NGPU] [--phones-dict PHONES_DICT]
+                [--speaker-dict SPEAKER_DICT] [--speech-stretchs SPEECH_STRETCHS]
+
+Train a FastSpeech2 model.
+
+optional arguments:
+  -h, --help            show this help message and exit
+  --config CONFIG       diffsinger config file.
+  --train-metadata TRAIN_METADATA
+                        training data.
+  --dev-metadata DEV_METADATA
+                        dev data.
+  --output-dir OUTPUT_DIR
+                        output dir.
+  --ngpu NGPU           if ngpu=0, use cpu.
+  --phones-dict PHONES_DICT
+                        phone vocabulary file.
+  --speaker-dict SPEAKER_DICT
+                        speaker id map file for multiple speaker model.
+  --speech-stretchs SPEECH_STRETCHS
+                        min amd max mel for stretching.
+```
+1. `--config` is a config file in yaml format to overwrite the default config, which can be found at `conf/default.yaml`.
+2. `--train-metadata` and `--dev-metadata` should be the metadata file in the normalized subfolder of `train` and `dev` in the `dump` folder.
+3. `--output-dir` is the directory to save the results of the experiment. Checkpoints are saved in `checkpoints/` inside this directory.
+4. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
+5. `--phones-dict` is the path of the phone vocabulary file.
+6. `--speech-stretchs` is the path of mel's min-max data file.
+
+### Synthesizing
+We use parallel wavegan as the neural vocoder.
+Download pretrained parallel wavegan model from [pwgan_opencpop_ckpt_1.4.0.zip](https://paddlespeech.bj.bcebos.com/t2s/svs/opencpop/pwgan_opencpop_ckpt_1.4.0.zip) and unzip it.
+```bash
+unzip pwgan_opencpop_ckpt_1.4.0.zip
+```
+Parallel WaveGAN checkpoint contains files listed below.
+```text
+pwgan_opencpop_ckpt_1.4.0.zip
+├── default.yaml                   # default config used to train parallel wavegan
+├── snapshot_iter_100000.pdz       # model parameters of parallel wavegan
+└── feats_stats.npy                # statistics used to normalize spectrogram when training parallel wavegan
+```
+`./local/synthesize.sh` calls `${BIN_DIR}/../synthesize.py`, which can synthesize waveform from `metadata.jsonl`.
+```bash
+CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_path} ${ckpt_name}
+```
+```text
+usage: synthesize.py [-h]
+                     [--am {diffsinger_opencpop}]
+                     [--am_config AM_CONFIG] [--am_ckpt AM_CKPT]
+                     [--am_stat AM_STAT] [--phones_dict PHONES_DICT]
+                     [--voc {pwgan_opencpop}]
+                     [--voc_config VOC_CONFIG] [--voc_ckpt VOC_CKPT]
+                     [--voc_stat VOC_STAT] [--ngpu NGPU]
+                     [--test_metadata TEST_METADATA] [--output_dir OUTPUT_DIR]
+                     [--speech_stretchs SPEECH_STRETCHS]
+
+Synthesize with acoustic model & vocoder
+
+optional arguments:
+  -h, --help            show this help message and exit
+  --am {speedyspeech_csmsc,fastspeech2_csmsc,fastspeech2_ljspeech,fastspeech2_aishell3,fastspeech2_vctk,tacotron2_csmsc,tacotron2_ljspeech,tacotron2_aishell3}
+                        Choose acoustic model type of tts task.
+       {diffsinger_opencpop} Choose acoustic model type of svs task.
+  --am_config AM_CONFIG
+                        Config of acoustic model.
+  --am_ckpt AM_CKPT     Checkpoint file of acoustic model.
+  --am_stat AM_STAT     mean and standard deviation used to normalize
+                        spectrogram when training acoustic model.
+  --phones_dict PHONES_DICT
+                        phone vocabulary file.
+  --tones_dict TONES_DICT
+                        tone vocabulary file.
+  --speaker_dict SPEAKER_DICT
+                        speaker id map file.
+  --voice-cloning VOICE_CLONING
+                        whether training voice cloning model.
+  --voc {pwgan_csmsc,pwgan_ljspeech,pwgan_aishell3,pwgan_vctk,mb_melgan_csmsc,wavernn_csmsc,hifigan_csmsc,hifigan_ljspeech,hifigan_aishell3,hifigan_vctk,style_melgan_csmsc}
+                        Choose vocoder type of tts task.
+        {pwgan_opencpop, hifigan_opencpop} Choose vocoder type of svs task.
+  --voc_config VOC_CONFIG
+                        Config of voc.
+  --voc_ckpt VOC_CKPT   Checkpoint file of voc.
+  --voc_stat VOC_STAT   mean and standard deviation used to normalize
+                        spectrogram when training voc.
+  --ngpu NGPU           if ngpu == 0, use cpu.
+  --test_metadata TEST_METADATA
+                        test metadata.
+  --output_dir OUTPUT_DIR
+                        output dir.
+  --speech-stretchs     SPEECH_STRETCHS
+                        The min and max values of the mel spectrum, using on diffusion of diffsinger.
+```
+
+`./local/synthesize_e2e.sh` calls `${BIN_DIR}/../synthesize_e2e.py`, which can synthesize waveform from text file. 
+`local/pinyin_to_phone.txt` comes from the readme of the opencpop dataset, indicating the mapping from pinyin to phonemes in opencpop.
+
+```bash
+CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize_e2e.sh ${conf_path} ${train_output_path} ${ckpt_name}
+```
+```text
+usage: synthesize_e2e.py [-h]
+                         [--am {speedyspeech_csmsc,speedyspeech_aishell3,fastspeech2_csmsc,fastspeech2_ljspeech,fastspeech2_aishell3,fastspeech2_vctk,tacotron2_csmsc,tacotron2_ljspeech}]
+                         [--am_config AM_CONFIG] [--am_ckpt AM_CKPT]
+                         [--am_stat AM_STAT] [--phones_dict PHONES_DICT]
+                         [--speaker_dict SPEAKER_DICT] [--spk_id SPK_ID]
+                         [--voc {pwgan_csmsc,pwgan_ljspeech,pwgan_aishell3,pwgan_vctk,mb_melgan_csmsc,style_melgan_csmsc,hifigan_csmsc,hifigan_ljspeech,hifigan_aishell3,hifigan_vctk,wavernn_csmsc}]
+                         [--voc_config VOC_CONFIG] [--voc_ckpt VOC_CKPT]
+                         [--voc_stat VOC_STAT] [--lang LANG]
+                         [--inference_dir INFERENCE_DIR] [--ngpu NGPU]
+                         [--text TEXT] [--output_dir OUTPUT_DIR]
+                         [--pinyin_phone PINYIN_PHONE]
+                         [--speech_stretchs SPEECH_STRETCHS]
+
+Synthesize with acoustic model & vocoder
+
+optional arguments:
+  -h, --help            show this help message and exit
+  --am {speedyspeech_csmsc,speedyspeech_aishell3,fastspeech2_csmsc,fastspeech2_ljspeech,fastspeech2_aishell3,fastspeech2_vctk,tacotron2_csmsc,tacotron2_ljspeech}
+                        Choose acoustic model type of tts task.
+       {diffsinger_opencpop} Choose acoustic model type of svs task.
+  --am_config AM_CONFIG
+                        Config of acoustic model.
+  --am_ckpt AM_CKPT     Checkpoint file of acoustic model.
+  --am_stat AM_STAT     mean and standard deviation used to normalize
+                        spectrogram when training acoustic model.
+  --phones_dict PHONES_DICT
+                        phone vocabulary file.
+  --speaker_dict SPEAKER_DICT
+                        speaker id map file.
+  --spk_id SPK_ID       spk id for multi speaker acoustic model
+  --voc {pwgan_csmsc,pwgan_ljspeech,pwgan_aishell3,pwgan_vctk,mb_melgan_csmsc,style_melgan_csmsc,hifigan_csmsc,hifigan_ljspeech,hifigan_aishell3,hifigan_vctk,wavernn_csmsc}
+                        Choose vocoder type of tts task.
+        {pwgan_opencpop, hifigan_opencpop} Choose vocoder type of svs task.
+  --voc_config VOC_CONFIG
+                        Config of voc.
+  --voc_ckpt VOC_CKPT   Checkpoint file of voc.
+  --voc_stat VOC_STAT   mean and standard deviation used to normalize
+                        spectrogram when training voc.
+  --lang LANG           {zh, en, mix, canton} Choose language type of tts task.
+                        {sing} Choose language type of svs task.
+  --inference_dir INFERENCE_DIR
+                        dir to save inference models
+  --ngpu NGPU           if ngpu == 0, use cpu.
+  --text TEXT           text to synthesize file, a 'utt_id sentence' pair per line for tts task.
+                        A '{ utt_id input_type (is word) text notes note_durs}' or '{utt_id input_type (is phoneme) phones notes note_durs is_slurs}' pair per line for svs task.
+  --output_dir OUTPUT_DIR
+                        output dir.
+  --pinyin_phone PINYIN_PHONE
+                        pinyin to phone map file, using on sing_frontend.
+  --speech_stretchs SPEECH_STRETCHS
+                        The min and max values of the mel spectrum, using on diffusion of diffsinger.
+```
+1. `--am` is acoustic model type with the format {model_name}_{dataset}
+2. `--am_config`, `--am_ckpt`, `--am_stat` and `--phones_dict` are arguments for acoustic model, which correspond to the 4 files in the diffsinger pretrained model.
+3. `--voc` is vocoder type with the format {model_name}_{dataset}
+4. `--voc_config`, `--voc_ckpt`, `--voc_stat` are arguments for vocoder, which correspond to the 3 files in the parallel wavegan pretrained model.
+5. `--lang` is language. `zh`, `en`, `mix` and `canton` for tts task. `sing` for tts task.
+6. `--test_metadata` should be the metadata file in the normalized subfolder of `test`  in the `dump` folder.
+7. `--text` is the text file, which contains sentences to synthesize.
+8. `--output_dir` is the directory to save synthesized audio files.
+9. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
+10. `--inference_dir` is the directory to save static models. If this line is not added, it will not be generated and saved as a static model.
+11. `--pinyin_phone` pinyin to phone map file, using on sing_frontend.
+12. `--speech_stretchs` The min and max values of the mel spectrum, using on diffusion of diffsinger.
+
+Note: At present, the diffsinger model does not support dynamic to static, so do not add `--inference_dir`.
+
+
+## Pretrained Model
+Pretrained DiffSinger model:
+- [diffsinger_opencpop_ckpt_1.4.0.zip](https://paddlespeech.bj.bcebos.com/t2s/svs/opencpop/diffsinger_opencpop_ckpt_1.4.0.zip)
+
+DiffSinger checkpoint contains files listed below.
+```text
+diffsinger_opencpop_ckpt_1.4.0.zip
+├── default.yaml             # default config used to train diffsinger
+├── energy_stats.npy         # statistics used to normalize energy when training diffsinger if norm is needed
+├── phone_id_map.txt         # phone vocabulary file when training diffsinger
+├── pinyin_to_phone.txt      # pinyin-to-phoneme mapping file when training diffsinger
+├── pitch_stats.npy          # statistics used to normalize pitch when training diffsinger if norm is needed 
+├── snapshot_iter_160000.pdz # model parameters of diffsinger
+├── speech_stats.npy         # statistics used to normalize mel when training diffsinger if norm is needed
+└── speech_stretchs.npy      # min and max values to use for mel spectral stretching before training diffusion
+
+```
+
+You can use the following scripts to synthesize for `${BIN_DIR}/../sentences_sing.txt` using pretrained diffsinger and parallel wavegan models.
+
+```bash
+source path.sh
+
+FLAGS_allocator_strategy=naive_best_fit \
+FLAGS_fraction_of_gpu_memory_to_use=0.01 \
+python3 ${BIN_DIR}/../synthesize_e2e.py \
+  --am=diffsinger_opencpop \
+  --am_config=diffsinger_opencpop_ckpt_1.4.0/default.yaml \
+  --am_ckpt=diffsinger_opencpop_ckpt_1.4.0/snapshot_iter_160000.pdz \
+  --am_stat=diffsinger_opencpop_ckpt_1.4.0/speech_stats.npy  \
+  --voc=pwgan_opencpop \
+  --voc_config=pwgan_opencpop_ckpt_1.4.0/default.yaml \
+  --voc_ckpt=pwgan_opencpop_ckpt_1.4.0/snapshot_iter_100000.pdz \
+  --voc_stat=pwgan_opencpop_ckpt_1.4.0/feats_stats.npy \
+  --lang=sing \
+  --text=${BIN_DIR}/../sentences_sing.txt \
+  --output_dir=exp/default/test_e2e \
+  --phones_dict=diffsinger_opencpop_ckpt_1.4.0/phone_id_map.txt \
+  --pinyin_phone=diffsinger_opencpop_ckpt_1.4.0/pinyin_to_phone.txt \
+  --speech_stretchs=diffsinger_opencpop_ckpt_1.4.0/speech_stretchs.npy
+  
+```
--- a/examples/opencpop/svs1/README_cn.md
+++ b/examples/opencpop/svs1/README_cn.md
@ -0,0 +1,280 @@
+(简体中文|[English](./README.md))
+# 用 Opencpop 数据集训练 DiffSinger 模型
+
+本用例包含用于训练 [DiffSinger](https://arxiv.org/abs/2105.02446) 模型的代码，使用 [Mandarin singing corpus](https://wenet.org.cn/opencpop/) 数据集。
+
+## 数据集
+### 下载并解压
+从 [官方网站](https://wenet.org.cn/opencpop/download/) 下载数据集
+
+## 开始
+假设数据集的路径是 `~/datasets/Opencpop`.
+运行下面的命令会进行如下操作：
+
+1. **设置原路径**。
+2. 对数据集进行预处理。
+3. 训练模型
+4. 合成波形
+    - 从 `metadata.jsonl` 合成波形。
+    - （支持中）从文本文件合成波形。
+5. （支持中）使用静态模型进行推理。
+```bash
+./run.sh
+```
+您可以选择要运行的一系列阶段，或者将 `stage` 设置为 `stop-stage` 以仅使用一个阶段，例如，运行以下命令只会预处理数据集。
+```bash
+./run.sh --stage 0 --stop-stage 0
+```
+### 数据预处理
+```bash
+./local/preprocess.sh ${conf_path}
+```
+当它完成时。将在当前目录中创建 `dump` 文件夹。转储文件夹的结构如下所示。
+
+```text
+dump
+├── dev
+│   ├── norm
+│   └── raw
+├── phone_id_map.txt
+├── speaker_id_map.txt
+├── test
+│   ├── norm
+│   └── raw
+└── train
+    ├── energy_stats.npy
+    ├── norm
+    ├── pitch_stats.npy
+    ├── raw
+    ├── speech_stats.npy
+    └── speech_stretchs.npy
+```
+
+数据集分为三个部分，即 `train` 、 `dev` 和 `test` ，每个部分都包含一个 `norm` 和 `raw` 子文件夹。原始文件夹包含每个话语的语音、音调和能量特征，而 `norm` 文件夹包含规范化的特征。用于规范化特征的统计数据是从 `dump/train/*_stats.npy` 中的训练集计算出来的。`speech_stretchs.npy` 中包含 mel谱每个维度上的最小值和最大值，用于 diffusion 模块训练/推理前的线性拉伸。
+注意：由于非 norm 特征训练效果由于 norm，因此 `norm` 下保存的特征是未经过 norm 的特征。
+
+
+此外，还有一个 `metadata.jsonl` 在每个子文件夹中。它是一个类似表格的文件，包含话语id，音色id，音素、文本长度、语音长度、音素持续时间、语音特征路径、音调特征路径、能量特征路径、音调，音调持续时间，是否为转音。
+
+### 模型训练
+```bash
+CUDA_VISIBLE_DEVICES=${gpus} ./local/train.sh ${conf_path} ${train_output_path}
+```
+`./local/train.sh` 调用 `${BIN_DIR}/train.py` 。
+以下是完整的帮助信息。
+
+```text
+usage: train.py [-h] [--config CONFIG] [--train-metadata TRAIN_METADATA]
+                [--dev-metadata DEV_METADATA] [--output-dir OUTPUT_DIR]
+                [--ngpu NGPU] [--phones-dict PHONES_DICT]
+                [--speaker-dict SPEAKER_DICT] [--speech-stretchs SPEECH_STRETCHS]
+
+Train a DiffSinger model.
+
+optional arguments:
+  -h, --help            show this help message and exit
+  --config CONFIG       diffsinger config file.
+  --train-metadata TRAIN_METADATA
+                        training data.
+  --dev-metadata DEV_METADATA
+                        dev data.
+  --output-dir OUTPUT_DIR
+                        output dir.
+  --ngpu NGPU           if ngpu=0, use cpu.
+  --phones-dict PHONES_DICT
+                        phone vocabulary file.
+  --speaker-dict SPEAKER_DICT
+                        speaker id map file for multiple speaker model.
+  --speech-stretchs SPEECH_STRETCHS
+                        min amd max mel for stretching.
+```
+1. `--config` 是一个 yaml 格式的配置文件，用于覆盖默认配置，位于 `conf/default.yaml`.
+2. `--train-metadata` 和 `--dev-metadata` 应为 `dump` 文件夹中 `train` 和 `dev` 下的规范化元数据文件
+3. `--output-dir` 是保存结果的目录。 检查点保存在此目录中的 `checkpoints/` 目录下。
+4. `--ngpu` 要使用的 GPU 数，如果 ngpu==0，则使用 cpu 。
+5. `--phones-dict` 是音素词汇表文件的路径。
+6. `--speech-stretchs` mel的最小最大值数据的文件路径。
+
+### 合成
+我们使用 parallel opencpop 作为神经声码器（vocoder）。
+从 [pwgan_opencpop_ckpt_1.4.0.zip](https://paddlespeech.bj.bcebos.com/t2s/svs/opencpop/pwgan_opencpop_ckpt_1.4.0.zip) 下载预训练的 parallel wavegan 模型并将其解压。
+
+```bash
+unzip pwgan_opencpop_ckpt_1.4.0.zip
+```
+Parallel WaveGAN 检查点包含如下文件。
+```text
+pwgan_opencpop_ckpt_1.4.0.zip
+├── default.yaml               # 用于训练 parallel wavegan 的默认配置
+├── snapshot_iter_100000.pdz   # parallel wavegan 的模型参数
+└── feats_stats.npy            # 训练平行波形时用于规范化谱图的统计数据
+```
+`./local/synthesize.sh` 调用 `${BIN_DIR}/../synthesize.py` 即可从 `metadata.jsonl`中合成波形。
+
+```bash
+CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_path} ${ckpt_name}
+```
+```text
+usage: synthesize.py [-h]
+                     [--am {diffsinger_opencpop}]
+                     [--am_config AM_CONFIG] [--am_ckpt AM_CKPT]
+                     [--am_stat AM_STAT] [--phones_dict PHONES_DICT]
+                     [--voc {pwgan_opencpop}]
+                     [--voc_config VOC_CONFIG] [--voc_ckpt VOC_CKPT]
+                     [--voc_stat VOC_STAT] [--ngpu NGPU]
+                     [--test_metadata TEST_METADATA] [--output_dir OUTPUT_DIR]
+                     [--speech_stretchs SPEECH_STRETCHS]
+
+Synthesize with acoustic model & vocoder
+
+optional arguments:
+  -h, --help            show this help message and exit
+  --am {speedyspeech_csmsc,fastspeech2_csmsc,fastspeech2_ljspeech,fastspeech2_aishell3,fastspeech2_vctk,tacotron2_csmsc,tacotron2_ljspeech,tacotron2_aishell3}
+                        Choose acoustic model type of tts task.
+       {diffsinger_opencpop} Choose acoustic model type of svs task.
+  --am_config AM_CONFIG
+                        Config of acoustic model.
+  --am_ckpt AM_CKPT     Checkpoint file of acoustic model.
+  --am_stat AM_STAT     mean and standard deviation used to normalize
+                        spectrogram when training acoustic model.
+  --phones_dict PHONES_DICT
+                        phone vocabulary file.
+  --tones_dict TONES_DICT
+                        tone vocabulary file.
+  --speaker_dict SPEAKER_DICT
+                        speaker id map file.
+  --voice-cloning VOICE_CLONING
+                        whether training voice cloning model.
+  --voc {pwgan_csmsc,pwgan_ljspeech,pwgan_aishell3,pwgan_vctk,mb_melgan_csmsc,wavernn_csmsc,hifigan_csmsc,hifigan_ljspeech,hifigan_aishell3,hifigan_vctk,style_melgan_csmsc}
+                        Choose vocoder type of tts task.
+        {pwgan_opencpop, hifigan_opencpop} Choose vocoder type of svs task.
+  --voc_config VOC_CONFIG
+                        Config of voc.
+  --voc_ckpt VOC_CKPT   Checkpoint file of voc.
+  --voc_stat VOC_STAT   mean and standard deviation used to normalize
+                        spectrogram when training voc.
+  --ngpu NGPU           if ngpu == 0, use cpu.
+  --test_metadata TEST_METADATA
+                        test metadata.
+  --output_dir OUTPUT_DIR
+                        output dir.
+  --speech-stretchs     SPEECH_STRETCHS
+                        The min and max values of the mel spectrum, using on diffusion of diffsinger.
+```
+
+`./local/synthesize_e2e.sh` 调用 `${BIN_DIR}/../synthesize_e2e.py`，即可从文本文件中合成波形。
+`local/pinyin_to_phone.txt`来源于opencpop数据集中的README，表示opencpop中拼音到音素的映射。
+
+```bash
+CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize_e2e.sh ${conf_path} ${train_output_path} ${ckpt_name}
+```
+```text
+usage: synthesize_e2e.py [-h]
+                         [--am {speedyspeech_csmsc,speedyspeech_aishell3,fastspeech2_csmsc,fastspeech2_ljspeech,fastspeech2_aishell3,fastspeech2_vctk,tacotron2_csmsc,tacotron2_ljspeech}]
+                         [--am_config AM_CONFIG] [--am_ckpt AM_CKPT]
+                         [--am_stat AM_STAT] [--phones_dict PHONES_DICT]
+                         [--speaker_dict SPEAKER_DICT] [--spk_id SPK_ID]
+                         [--voc {pwgan_csmsc,pwgan_ljspeech,pwgan_aishell3,pwgan_vctk,mb_melgan_csmsc,style_melgan_csmsc,hifigan_csmsc,hifigan_ljspeech,hifigan_aishell3,hifigan_vctk,wavernn_csmsc}]
+                         [--voc_config VOC_CONFIG] [--voc_ckpt VOC_CKPT]
+                         [--voc_stat VOC_STAT] [--lang LANG]
+                         [--inference_dir INFERENCE_DIR] [--ngpu NGPU]
+                         [--text TEXT] [--output_dir OUTPUT_DIR]
+                         [--pinyin_phone PINYIN_PHONE]
+                         [--speech_stretchs SPEECH_STRETCHS]
+
+Synthesize with acoustic model & vocoder
+
+optional arguments:
+  -h, --help            show this help message and exit
+  --am {speedyspeech_csmsc,speedyspeech_aishell3,fastspeech2_csmsc,fastspeech2_ljspeech,fastspeech2_aishell3,fastspeech2_vctk,tacotron2_csmsc,tacotron2_ljspeech}
+                        Choose acoustic model type of tts task.
+       {diffsinger_opencpop} Choose acoustic model type of svs task.
+  --am_config AM_CONFIG
+                        Config of acoustic model.
+  --am_ckpt AM_CKPT     Checkpoint file of acoustic model.
+  --am_stat AM_STAT     mean and standard deviation used to normalize
+                        spectrogram when training acoustic model.
+  --phones_dict PHONES_DICT
+                        phone vocabulary file.
+  --speaker_dict SPEAKER_DICT
+                        speaker id map file.
+  --spk_id SPK_ID       spk id for multi speaker acoustic model
+  --voc {pwgan_csmsc,pwgan_ljspeech,pwgan_aishell3,pwgan_vctk,mb_melgan_csmsc,style_melgan_csmsc,hifigan_csmsc,hifigan_ljspeech,hifigan_aishell3,hifigan_vctk,wavernn_csmsc}
+                        Choose vocoder type of tts task.
+        {pwgan_opencpop, hifigan_opencpop} Choose vocoder type of svs task.
+  --voc_config VOC_CONFIG
+                        Config of voc.
+  --voc_ckpt VOC_CKPT   Checkpoint file of voc.
+  --voc_stat VOC_STAT   mean and standard deviation used to normalize
+                        spectrogram when training voc.
+  --lang LANG           {zh, en, mix, canton} Choose language type of tts task.
+                        {sing} Choose language type of svs task.
+  --inference_dir INFERENCE_DIR
+                        dir to save inference models
+  --ngpu NGPU           if ngpu == 0, use cpu.
+  --text TEXT           text to synthesize file, a 'utt_id sentence' pair per line for tts task.
+                        A '{ utt_id input_type (is word) text notes note_durs}' or '{utt_id input_type (is phoneme) phones notes note_durs is_slurs}' pair per line for svs task.
+  --output_dir OUTPUT_DIR
+                        output dir.
+  --pinyin_phone PINYIN_PHONE
+                        pinyin to phone map file, using on sing_frontend.
+  --speech_stretchs SPEECH_STRETCHS
+                        The min and max values of the mel spectrum, using on diffusion of diffsinger.
+```
+1. `--am` 声学模型格式是否符合 {model_name}_{dataset}
+2. `--am_config`, `--am_ckpt`, `--am_stat` 和 `--phones_dict` 是声学模型的参数，对应于 diffsinger 预训练模型中的 4 个文件。
+3. `--voc` 声码器(vocoder)格式是否符合 {model_name}_{dataset}
+4. `--voc_config`, `--voc_ckpt`, `--voc_stat` 是声码器的参数，对应于 parallel wavegan 预训练模型中的 3 个文件。
+5. `--lang` tts对应模型的语言可以是 `zh`、`en`、`mix`和`canton`。 svs 对应的语言是 `sing` 。
+6. `--test_metadata` 应为 `dump` 文件夹中 `test` 下的规范化元数据文件、
+7. `--text` 是文本文件，其中包含要合成的句子。
+8. `--output_dir` 是保存合成音频文件的目录。
+9. `--ngpu` 要使用的GPU数，如果 ngpu==0，则使用 cpu。
+10. `--inference_dir` 静态模型保存的目录。如果不加这一行，就不会生并保存成静态模型。
+11. `--pinyin_phone` 拼音到音素的映射文件。
+12. `--speech_stretchs` mel谱的最大最小值用于diffsinger中diffusion之前的线性拉伸。
+
+注意： 目前 diffsinger 模型还不支持动转静，所以不要加 `--inference_dir`。
+
+
+## 预训练模型
+预先训练的 DiffSinger 模型：
+- [diffsinger_opencpop_ckpt_1.4.0.zip](https://paddlespeech.bj.bcebos.com/t2s/svs/opencpop/diffsinger_opencpop_ckpt_1.4.0.zip)
+
+
+DiffSinger 检查点包含下列文件。
+```text
+diffsinger_opencpop_ckpt_1.4.0.zip
+├── default.yaml             # 用于训练 diffsinger 的默认配置
+├── energy_stats.npy         # 训练 diffsinger 时如若需要 norm energy 会使用到的统计数据 
+├── phone_id_map.txt         # 训练 diffsinger 时的音素词汇文件
+├── pinyin_to_phone.txt      # 训练 diffsinger 时的拼音到音素映射文件
+├── pitch_stats.npy          # 训练 diffsinger 时如若需要 norm pitch 会使用到的统计数据 
+├── snapshot_iter_160000.pdz # 模型参数和优化器状态
+├── speech_stats.npy         # 训练 diffsinger 时用于规范化频谱图的统计数据
+└── speech_stretchs.npy      # 训练 diffusion 前用于 mel 谱拉伸的最小及最大值
+
+```
+您可以使用以下脚本通过使用预训练的 diffsinger 和 parallel wavegan 模型为 `${BIN_DIR}/../sentences_sing.txt` 合成句子
+```bash
+source path.sh
+
+FLAGS_allocator_strategy=naive_best_fit \
+FLAGS_fraction_of_gpu_memory_to_use=0.01 \
+python3 ${BIN_DIR}/../synthesize_e2e.py \
+  --am=diffsinger_opencpop \
+  --am_config=diffsinger_opencpop_ckpt_1.4.0/default.yaml \
+  --am_ckpt=diffsinger_opencpop_ckpt_1.4.0/snapshot_iter_160000.pdz \
+  --am_stat=diffsinger_opencpop_ckpt_1.4.0/speech_stats.npy  \
+  --voc=pwgan_opencpop \
+  --voc_config=pwgan_opencpop_ckpt_1.4.0/default.yaml \
+  --voc_ckpt=pwgan_opencpop_ckpt_1.4.0/snapshot_iter_100000.pdz \
+  --voc_stat=pwgan_opencpop_ckpt_1.4.0/feats_stats.npy \
+  --lang=sing \
+  --text=${BIN_DIR}/../sentences_sing.txt \
+  --output_dir=exp/default/test_e2e \
+  --phones_dict=diffsinger_opencpop_ckpt_1.4.0/phone_id_map.txt \
+  --pinyin_phone=diffsinger_opencpop_ckpt_1.4.0/pinyin_to_phone.txt \
+  --speech_stretchs=diffsinger_opencpop_ckpt_1.4.0/speech_stretchs.npy
+  
+```
--- a/examples/opencpop/svs1/conf/default.yaml
+++ b/examples/opencpop/svs1/conf/default.yaml
@ -0,0 +1,159 @@
+###########################################################
+#                FEATURE EXTRACTION SETTING               #
+###########################################################
+
+fs: 24000          # sr
+n_fft: 512         # FFT size (samples).
+n_shift: 128       # Hop size (samples). 12.5ms
+win_length: 512    # Window length (samples). 50ms
+                   # If set to null, it will be the same as fft_size.
+window: "hann"     # Window function.
+
+# Only used for feats_type != raw
+
+fmin: 30           # Minimum frequency of Mel basis.
+fmax: 12000        # Maximum frequency of Mel basis.
+n_mels: 80         # The number of mel basis.
+
+# Only used for the model using pitch features (e.g. FastSpeech2)
+f0min: 80          # Minimum f0 for pitch extraction.
+f0max: 750         # Maximum f0 for pitch extraction.
+
+
+###########################################################
+#                       DATA SETTING                      #
+###########################################################
+batch_size: 48     # batch size
+num_workers: 1     # number of gpu
+
+
+###########################################################
+#                       MODEL SETTING                     #
+###########################################################
+model:
+    # music score related
+    note_num: 300                                     # number of note
+    is_slur_num: 2                                    # number of slur
+    # fastspeech2 module options
+    use_energy_pred: False                            # whether use energy predictor
+    use_postnet: False                                # whether use postnet
+
+    # fastspeech2 module
+    fastspeech2_params:
+        adim: 256                                     # attention dimension
+        aheads: 2                                     # number of attention heads
+        elayers: 4                                    # number of encoder layers
+        eunits: 1024                                  # number of encoder ff units
+        dlayers: 4                                    # number of decoder layers
+        dunits: 1024                                  # number of decoder ff units
+        positionwise_layer_type: conv1d-linear        # type of position-wise layer
+        positionwise_conv_kernel_size: 9              # kernel size of position wise conv layer
+        transformer_enc_dropout_rate: 0.1             # dropout rate for transformer encoder layer
+        transformer_enc_positional_dropout_rate: 0.1  # dropout rate for transformer encoder positional encoding
+        transformer_enc_attn_dropout_rate: 0.0        # dropout rate for transformer encoder attention layer
+        transformer_activation_type: "gelu"           # Activation function type in transformer.
+        encoder_normalize_before: True                # whether to perform layer normalization before the input
+        decoder_normalize_before: True                # whether to perform layer normalization before the input
+        reduction_factor: 1                           # reduction factor
+        init_type: xavier_uniform                     # initialization type
+        init_enc_alpha: 1.0                           # initial value of alpha of encoder scaled position encoding
+        init_dec_alpha: 1.0                           # initial value of alpha of decoder scaled position encoding
+        use_scaled_pos_enc: True                      # whether to use scaled positional encoding
+        transformer_dec_dropout_rate: 0.1             # dropout rate for transformer decoder layer
+        transformer_dec_positional_dropout_rate: 0.1  # dropout rate for transformer decoder positional encoding
+        transformer_dec_attn_dropout_rate: 0.0        # dropout rate for transformer decoder attention layer
+        duration_predictor_layers: 5                  # number of layers of duration predictor
+        duration_predictor_chans: 256                 # number of channels of duration predictor
+        duration_predictor_kernel_size: 3             # filter size of duration predictor
+        duration_predictor_dropout_rate: 0.5          # dropout rate in energy predictor
+        pitch_predictor_layers: 5                     # number of conv layers in pitch predictor
+        pitch_predictor_chans: 256                    # number of channels of conv layers in pitch predictor
+        pitch_predictor_kernel_size: 5                # kernel size of conv leyers in pitch predictor
+        pitch_predictor_dropout: 0.5                  # dropout rate in pitch predictor
+        pitch_embed_kernel_size: 1                    # kernel size of conv embedding layer for pitch
+        pitch_embed_dropout: 0.0                      # dropout rate after conv embedding layer for pitch
+        stop_gradient_from_pitch_predictor: True      # whether to stop the gradient from pitch predictor to encoder
+        energy_predictor_layers: 2                    # number of conv layers in energy predictor
+        energy_predictor_chans: 256                   # number of channels of conv layers in energy predictor
+        energy_predictor_kernel_size: 3               # kernel size of conv leyers in energy predictor
+        energy_predictor_dropout: 0.5                 # dropout rate in energy predictor
+        energy_embed_kernel_size: 1                   # kernel size of conv embedding layer for energy
+        energy_embed_dropout: 0.0                     # dropout rate after conv embedding layer for energy
+        stop_gradient_from_energy_predictor: False    # whether to stop the gradient from energy predictor to encoder
+        postnet_layers: 5                             # number of layers of postnet
+        postnet_filts: 5                              # filter size of conv layers in postnet
+        postnet_chans: 256                            # number of channels of conv layers in postnet
+        postnet_dropout_rate: 0.5                     # dropout rate for postnet
+ 
+    # denoiser module
+    denoiser_params:
+        in_channels: 80                               # Number of channels of the input mel-spectrogram
+        out_channels: 80                              # Number of channels of the output mel-spectrogram
+        kernel_size: 3                                # Kernel size of the residual blocks inside                           
+        layers: 20                                    # Number of residual blocks inside
+        stacks: 5                                     # The number of groups to split the residual blocks into
+        residual_channels: 256                        # Residual channel of the residual blocks
+        gate_channels: 512                            # Gate channel of the residual blocks
+        skip_channels: 256                            # Skip channel of the residual blocks
+        aux_channels: 256                             # Auxiliary channel of the residual blocks
+        dropout: 0.1                                  # Dropout of the residual blocks
+        bias: True                                    # Whether to use bias in residual blocks
+        use_weight_norm: False                        # Whether to use weight norm in all convolutions
+        init_type: "kaiming_normal"                   # Type of initialize weights of a neural network module
+
+
+    diffusion_params:
+        num_train_timesteps: 100                      # The number of timesteps between the noise and the real during training
+        beta_start: 0.0001                            # beta start parameter for the scheduler
+        beta_end: 0.06                                # beta end parameter for the scheduler
+        beta_schedule: "linear"                       # beta schedule parameter for the scheduler
+        num_max_timesteps: 100                        # The max timestep transition from real to noise
+        stretch: True                                 # whether to stretch before diffusion
+
+
+###########################################################
+#                       UPDATER SETTING                   #
+###########################################################
+fs2_updater:
+    use_masking: True                 # whether to apply masking for padded part in loss calculation
+
+ds_updater:
+    use_masking: True                 # whether to apply masking for padded part in loss calculation
+
+
+###########################################################
+#                     OPTIMIZER SETTING                   #
+###########################################################
+# fastspeech2 optimizer
+fs2_optimizer:
+    optim: adam              # optimizer type
+    learning_rate: 0.001     # learning rate
+
+# diffusion optimizer
+ds_optimizer_params:
+    beta1: 0.9
+    beta2: 0.98
+    weight_decay: 0.0
+
+ds_scheduler_params:
+    learning_rate: 0.001              
+    gamma: 0.5                          
+    step_size: 50000
+ds_grad_norm: 1
+
+
+###########################################################
+#                    INTERVAL SETTING                     #
+###########################################################
+only_train_diffusion: True                 # Whether to freeze fastspeech2 parameters when training diffusion
+ds_train_start_steps: 160000              # Number of steps to start to train diffusion module.
+train_max_steps: 320000                   # Number of training steps.
+save_interval_steps: 2000                 # Interval steps to save checkpoint.
+eval_interval_steps: 2000                 # Interval steps to evaluate the network.
+num_snapshots: 5
+
+
+###########################################################
+#                       OTHER SETTING                     #
+###########################################################
+seed: 10086
--- a/examples/opencpop/svs1/local/pinyin_to_phone.txt
+++ b/examples/opencpop/svs1/local/pinyin_to_phone.txt
@ -0,0 +1,418 @@
+a|a
+ai|ai
+an|an
+ang|ang
+ao|ao
+ba|b a
+bai|b ai
+ban|b an
+bang|b ang
+bao|b ao
+bei|b ei
+ben|b en
+beng|b eng
+bi|b i
+bian|b ian
+biao|b iao
+bie|b ie
+bin|b in
+bing|b ing
+bo|b o
+bu|b u
+ca|c a
+cai|c ai
+can|c an
+cang|c ang
+cao|c ao
+ce|c e
+cei|c ei
+cen|c en
+ceng|c eng
+cha|ch a
+chai|ch ai
+chan|ch an
+chang|ch ang
+chao|ch ao
+che|ch e
+chen|ch en
+cheng|ch eng
+chi|ch i
+chong|ch ong
+chou|ch ou
+chu|ch u
+chua|ch ua
+chuai|ch uai
+chuan|ch uan
+chuang|ch uang
+chui|ch ui
+chun|ch un
+chuo|ch uo
+ci|c i
+cong|c ong
+cou|c ou
+cu|c u
+cuan|c uan
+cui|c ui
+cun|c un
+cuo|c uo
+da|d a
+dai|d ai
+dan|d an
+dang|d ang
+dao|d ao
+de|d e
+dei|d ei
+den|d en
+deng|d eng
+di|d i
+dia|d ia
+dian|d ian
+diao|d iao
+die|d ie
+ding|d ing
+diu|d iu
+dong|d ong
+dou|d ou
+du|d u
+duan|d uan
+dui|d ui
+dun|d un
+duo|d uo
+e|e
+ei|ei
+en|en
+eng|eng
+er|er
+fa|f a
+fan|f an
+fang|f ang
+fei|f ei
+fen|f en
+feng|f eng
+fo|f o
+fou|f ou
+fu|f u
+ga|g a
+gai|g ai
+gan|g an
+gang|g ang
+gao|g ao
+ge|g e
+gei|g ei
+gen|g en
+geng|g eng
+gong|g ong
+gou|g ou
+gu|g u
+gua|g ua
+guai|g uai
+guan|g uan
+guang|g uang
+gui|g ui
+gun|g un
+guo|g uo
+ha|h a
+hai|h ai
+han|h an
+hang|h ang
+hao|h ao
+he|h e
+hei|h ei
+hen|h en
+heng|h eng
+hm|h m
+hng|h ng
+hong|h ong
+hou|h ou
+hu|h u
+hua|h ua
+huai|h uai
+huan|h uan
+huang|h uang
+hui|h ui
+hun|h un
+huo|h uo
+ji|j i
+jia|j ia
+jian|j ian
+jiang|j iang
+jiao|j iao
+jie|j ie
+jin|j in
+jing|j ing
+jiong|j iong
+jiu|j iu
+ju|j v
+juan|j van
+jue|j ve
+jun|j vn
+ka|k a
+kai|k ai
+kan|k an
+kang|k ang
+kao|k ao
+ke|k e
+kei|k ei
+ken|k en
+keng|k eng
+kong|k ong
+kou|k ou
+ku|k u
+kua|k ua
+kuai|k uai
+kuan|k uan
+kuang|k uang
+kui|k ui
+kun|k un
+kuo|k uo
+la|l a
+lai|l ai
+lan|l an
+lang|l ang
+lao|l ao
+le|l e
+lei|l ei
+leng|l eng
+li|l i
+lia|l ia
+lian|l ian
+liang|l iang
+liao|l iao
+lie|l ie
+lin|l in
+ling|l ing
+liu|l iu
+lo|l o
+long|l ong
+lou|l ou
+lu|l u
+luan|l uan
+lun|l un
+luo|l uo
+lv|l v
+lve|l ve
+m|m
+ma|m a
+mai|m ai
+man|m an
+mang|m ang
+mao|m ao
+me|m e
+mei|m ei
+men|m en
+meng|m eng
+mi|m i
+mian|m ian
+miao|m iao
+mie|m ie
+min|m in
+ming|m ing
+miu|m iu
+mo|m o
+mou|m ou
+mu|m u
+n|n
+na|n a
+nai|n ai
+nan|n an
+nang|n ang
+nao|n ao
+ne|n e
+nei|n ei
+nen|n en
+neng|n eng
+ng|n g
+ni|n i
+nian|n ian
+niang|n iang
+niao|n iao
+nie|n ie
+nin|n in
+ning|n ing
+niu|n iu
+nong|n ong
+nou|n ou
+nu|n u
+nuan|n uan
+nun|n un
+nuo|n uo
+nv|n v
+nve|n ve
+o|o
+ou|ou
+pa|p a
+pai|p ai
+pan|p an
+pang|p ang
+pao|p ao
+pei|p ei
+pen|p en
+peng|p eng
+pi|p i
+pian|p ian
+piao|p iao
+pie|p ie
+pin|p in
+ping|p ing
+po|p o
+pou|p ou
+pu|p u
+qi|q i
+qia|q ia
+qian|q ian
+qiang|q iang
+qiao|q iao
+qie|q ie
+qin|q in
+qing|q ing
+qiong|q iong
+qiu|q iu
+qu|q v
+quan|q van
+que|q ve
+qun|q vn
+ran|r an
+rang|r ang
+rao|r ao
+re|r e
+ren|r en
+reng|r eng
+ri|r i
+rong|r ong
+rou|r ou
+ru|r u
+rua|r ua
+ruan|r uan
+rui|r ui
+run|r un
+ruo|r uo
+sa|s a
+sai|s ai
+san|s an
+sang|s ang
+sao|s ao
+se|s e
+sen|s en
+seng|s eng
+sha|sh a
+shai|sh ai
+shan|sh an
+shang|sh ang
+shao|sh ao
+she|sh e
+shei|sh ei
+shen|sh en
+sheng|sh eng
+shi|sh i
+shou|sh ou
+shu|sh u
+shua|sh ua
+shuai|sh uai
+shuan|sh uan
+shuang|sh uang
+shui|sh ui
+shun|sh un
+shuo|sh uo
+si|s i
+song|s ong
+sou|s ou
+su|s u
+suan|s uan
+sui|s ui
+sun|s un
+suo|s uo
+ta|t a
+tai|t ai
+tan|t an
+tang|t ang
+tao|t ao
+te|t e
+tei|t ei
+teng|t eng
+ti|t i
+tian|t ian
+tiao|t iao
+tie|t ie
+ting|t ing
+tong|t ong
+tou|t ou
+tu|t u
+tuan|t uan
+tui|t ui
+tun|t un
+tuo|t uo
+wa|w a
+wai|w ai
+wan|w an
+wang|w ang
+wei|w ei
+wen|w en
+weng|w eng
+wo|w o
+wu|w u
+xi|x i
+xia|x ia
+xian|x ian
+xiang|x iang
+xiao|x iao
+xie|x ie
+xin|x in
+xing|x ing
+xiong|x iong
+xiu|x iu
+xu|x v
+xuan|x van
+xue|x ve
+xun|x vn
+ya|y a
+yan|y an
+yang|y ang
+yao|y ao
+ye|y e
+yi|y i
+yin|y in
+ying|y ing
+yo|y o
+yong|y ong
+you|y ou
+yu|y v
+yuan|y van
+yue|y ve
+yun|y vn
+za|z a
+zai|z ai
+zan|z an
+zang|z ang
+zao|z ao
+ze|z e
+zei|z ei
+zen|z en
+zeng|z eng
+zha|zh a
+zhai|zh ai
+zhan|zh an
+zhang|zh ang
+zhao|zh ao
+zhe|zh e
+zhei|zh ei
+zhen|zh en
+zheng|zh eng
+zhi|zh i
+zhong|zh ong
+zhou|zh ou
+zhu|zh u
+zhua|zh ua
+zhuai|zh uai
+zhuan|zh uan
+zhuang|zh uang
+zhui|zh ui
+zhun|zh un
+zhuo|zh uo
+zi|z i
+zong|z ong
+zou|z ou
+zu|z u
+zuan|z uan
+zui|z ui
+zun|z un
+zuo|z uo
--- a/examples/opencpop/svs1/local/preprocess.sh
+++ b/examples/opencpop/svs1/local/preprocess.sh
@ -0,0 +1,74 @@
+#!/bin/bash
+
+stage=0
+stop_stage=100
+
+config_path=$1
+
+if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
+    # extract features
+    echo "Extract features ..."
+    python3 ${BIN_DIR}/preprocess.py \
+        --dataset=opencpop \
+        --rootdir=~/datasets/Opencpop/segments \
+        --dumpdir=dump \
+        --label-file=~/datasets/Opencpop/segments/transcriptions.txt \
+        --config=${config_path} \
+        --num-cpu=20 \
+        --cut-sil=True
+fi
+
+if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
+    # get features' stats(mean and std)
+    echo "Get features' stats ..."
+    python3 ${MAIN_ROOT}/utils/compute_statistics.py \
+        --metadata=dump/train/raw/metadata.jsonl \
+        --field-name="speech"
+
+    python3 ${MAIN_ROOT}/utils/compute_statistics.py \
+        --metadata=dump/train/raw/metadata.jsonl \
+        --field-name="pitch"
+
+    python3 ${MAIN_ROOT}/utils/compute_statistics.py \
+        --metadata=dump/train/raw/metadata.jsonl \
+        --field-name="energy"
+fi
+
+if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
+    # normalize and covert phone/speaker to id, dev and test should use train's stats
+    echo "Normalize ..."
+    python3 ${BIN_DIR}/normalize.py \
+        --metadata=dump/train/raw/metadata.jsonl \
+        --dumpdir=dump/train/norm \
+        --speech-stats=dump/train/speech_stats.npy \
+        --pitch-stats=dump/train/pitch_stats.npy \
+        --energy-stats=dump/train/energy_stats.npy \
+        --phones-dict=dump/phone_id_map.txt \
+        --speaker-dict=dump/speaker_id_map.txt
+
+    python3 ${BIN_DIR}/normalize.py \
+        --metadata=dump/dev/raw/metadata.jsonl \
+        --dumpdir=dump/dev/norm \
+        --speech-stats=dump/train/speech_stats.npy \
+        --pitch-stats=dump/train/pitch_stats.npy \
+        --energy-stats=dump/train/energy_stats.npy \
+        --phones-dict=dump/phone_id_map.txt \
+        --speaker-dict=dump/speaker_id_map.txt
+
+    python3 ${BIN_DIR}/normalize.py \
+        --metadata=dump/test/raw/metadata.jsonl \
+        --dumpdir=dump/test/norm \
+        --speech-stats=dump/train/speech_stats.npy \
+        --pitch-stats=dump/train/pitch_stats.npy \
+        --energy-stats=dump/train/energy_stats.npy \
+        --phones-dict=dump/phone_id_map.txt \
+        --speaker-dict=dump/speaker_id_map.txt
+fi
+
+if [ ${stage} -le 4 ] && [ ${stop_stage} -ge 4 ]; then
+    # Get feature(mel) extremum for diffusion stretch
+    echo "Get feature(mel) extremum  ..."
+    python3 ${BIN_DIR}/get_minmax.py \
+        --metadata=dump/train/norm/metadata.jsonl \
+        --speech-stretchs=dump/train/speech_stretchs.npy
+fi
--- a/examples/opencpop/svs1/local/synthesize.sh
+++ b/examples/opencpop/svs1/local/synthesize.sh
@ -0,0 +1,27 @@
+#!/bin/bash
+
+config_path=$1
+train_output_path=$2
+ckpt_name=$3
+stage=0
+stop_stage=0
+
+# pwgan
+if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
+    FLAGS_allocator_strategy=naive_best_fit \
+    FLAGS_fraction_of_gpu_memory_to_use=0.01 \
+    python3 ${BIN_DIR}/../synthesize.py \
+        --am=diffsinger_opencpop \
+        --am_config=${config_path} \
+        --am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
+        --am_stat=dump/train/speech_stats.npy \
+        --voc=pwgan_opencpop \
+        --voc_config=pwgan_opencpop_ckpt_1.4.0/default.yaml \
+        --voc_ckpt=pwgan_opencpop_ckpt_1.4.0/snapshot_iter_100000.pdz \
+        --voc_stat=pwgan_opencpop_ckpt_1.4.0/feats_stats.npy \
+        --test_metadata=dump/test/norm/metadata.jsonl \
+        --output_dir=${train_output_path}/test \
+        --phones_dict=dump/phone_id_map.txt \
+	--speech_stretchs=dump/train/speech_stretchs.npy
+fi
+
--- a/examples/opencpop/svs1/local/synthesize_e2e.sh
+++ b/examples/opencpop/svs1/local/synthesize_e2e.sh
@ -0,0 +1,53 @@
+#!/bin/bash
+
+config_path=$1
+train_output_path=$2
+ckpt_name=$3
+
+stage=0
+stop_stage=0
+
+# pwgan
+if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
+    FLAGS_allocator_strategy=naive_best_fit \
+    FLAGS_fraction_of_gpu_memory_to_use=0.01 \
+    python3 ${BIN_DIR}/../synthesize_e2e.py \
+        --am=diffsinger_opencpop \
+        --am_config=${config_path} \
+        --am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
+        --am_stat=dump/train/speech_stats.npy \
+        --voc=pwgan_opencpop \
+        --voc_config=pwgan_opencpop_ckpt_1.4.0/default.yaml \
+        --voc_ckpt=pwgan_opencpop_ckpt_1.4.0/snapshot_iter_100000.pdz \
+        --voc_stat=pwgan_opencpop_ckpt_1.4.0/feats_stats.npy \
+        --lang=sing \
+        --text=${BIN_DIR}/../sentences_sing.txt \
+        --output_dir=${train_output_path}/test_e2e \
+        --phones_dict=dump/phone_id_map.txt \
+        --speech_stretchs=dump/train/speech_stretchs.npy \
+        --pinyin_phone=local/pinyin_to_phone.txt
+fi
+
+# for more GAN Vocoders
+# hifigan
+if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
+    echo "in hifigan syn_e2e"
+    FLAGS_allocator_strategy=naive_best_fit \
+    FLAGS_fraction_of_gpu_memory_to_use=0.01 \
+    python3 ${BIN_DIR}/../synthesize_e2e.py \
+        --am=diffsinger_opencpop \
+        --am_config=${config_path} \
+        --am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
+        --am_stat=dump/train/speech_stats.npy \
+        --voc=hifigan_opencpop \
+        --voc_config=hifigan_opencpop_ckpt_1.4.0/default.yaml \
+        --voc_ckpt=hifigan_opencpop_ckpt_1.4.0/snapshot_iter_625000.pdz \
+        --voc_stat=hifigan_opencpop_ckpt_1.4.0/feats_stats.npy \
+        --lang=sing \
+        --text=${BIN_DIR}/../sentences_sing.txt \
+        --output_dir=${train_output_path}/test_e2e \
+        --phones_dict=dump/phone_id_map.txt \
+        --speech_stretchs=dump/train/speech_stretchs.npy \
+        --pinyin_phone=local/pinyin_to_phone.txt
+        
+fi
--- a/examples/opencpop/svs1/local/train.sh
+++ b/examples/opencpop/svs1/local/train.sh
@ -0,0 +1,13 @@
+#!/bin/bash
+
+config_path=$1
+train_output_path=$2
+
+python3 ${BIN_DIR}/train.py \
+    --train-metadata=dump/train/norm/metadata.jsonl \
+    --dev-metadata=dump/dev/norm/metadata.jsonl \
+    --config=${config_path} \
+    --output-dir=${train_output_path} \
+    --ngpu=1 \
+    --phones-dict=dump/phone_id_map.txt \
+    --speech-stretchs=dump/train/speech_stretchs.npy
--- a/examples/opencpop/svs1/path.sh
+++ b/examples/opencpop/svs1/path.sh
@ -0,0 +1,13 @@
+#!/bin/bash
+export MAIN_ROOT=`realpath ${PWD}/../../../`
+
+export PATH=${MAIN_ROOT}:${MAIN_ROOT}/utils:${PATH}
+export LC_ALL=C
+
+export PYTHONDONTWRITEBYTECODE=1
+# Use UTF-8 in Python to avoid UnicodeDecodeError when LC_ALL=C
+export PYTHONIOENCODING=UTF-8
+export PYTHONPATH=${MAIN_ROOT}:${PYTHONPATH}
+
+MODEL=diffsinger
+export BIN_DIR=${MAIN_ROOT}/paddlespeech/t2s/exps/${MODEL}
--- a/examples/opencpop/svs1/run.sh
+++ b/examples/opencpop/svs1/run.sh
@ -0,0 +1,37 @@
+#!/bin/bash
+
+set -e
+source path.sh
+
+gpus=0
+stage=0
+stop_stage=100
+
+conf_path=conf/default.yaml
+train_output_path=exp/default
+ckpt_name=snapshot_iter_320000.pdz
+
+# with the following command, you can choose the stage range you want to run
+# such as `./run.sh --stage 0 --stop-stage 0`
+# this can not be mixed use with `$1`, `$2` ...
+source ${MAIN_ROOT}/utils/parse_options.sh || exit 1
+
+if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
+    # prepare data
+    ./local/preprocess.sh ${conf_path} || exit -1
+fi
+
+if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
+    # train model, all `ckpt` under `train_output_path/checkpoints/` dir
+    CUDA_VISIBLE_DEVICES=${gpus} ./local/train.sh ${conf_path} ${train_output_path} || exit -1
+fi
+
+if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
+    # synthesize, vocoder is pwgan by default
+    CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_path} ${ckpt_name} || exit -1
+fi
+
+if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
+    # synthesize_e2e, vocoder is pwgan by default
+    CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize_e2e.sh ${conf_path} ${train_output_path} ${ckpt_name} || exit -1
+fi
--- a/examples/opencpop/voc1/README.md
+++ b/examples/opencpop/voc1/README.md
@ -0,0 +1,139 @@
+# Parallel WaveGAN with Opencpop
+This example contains code used to train a [parallel wavegan](http://arxiv.org/abs/1910.11480) model with [Mandarin singing corpus](https://wenet.org.cn/opencpop/).
+
+## Dataset
+### Download and Extract
+Download Opencpop from it's [Official Website](https://wenet.org.cn/opencpop/download/) and extract it to `~/datasets`. Then the dataset is in the directory `~/datasets/Opencpop`.
+
+## Get Started
+Assume the path to the dataset is `~/datasets/Opencpop`.
+Run the command below to
+1. **source path**.
+2. preprocess the dataset.
+3. train the model.
+4. synthesize wavs.
+    - synthesize waveform from `metadata.jsonl`.
+```bash
+./run.sh
+```
+You can choose a range of stages you want to run, or set `stage` equal to `stop-stage` to use only one stage, for example, running the following command will only preprocess the dataset.
+```bash
+./run.sh --stage 0 --stop-stage 0
+```
+### Data Preprocessing
+```bash
+./local/preprocess.sh ${conf_path}
+```
+When it is done. A `dump` folder is created in the current directory. The structure of the dump folder is listed below.
+
+```text
+dump
+├── dev
+│   ├── norm
+│   └── raw
+├── test
+│   ├── norm
+│   └── raw
+└── train
+    ├── norm
+    ├── raw
+    └── feats_stats.npy
+```
+The dataset is split into 3 parts, namely `train`, `dev`, and `test`, each of which contains a `norm` and `raw` subfolder. The `raw` folder contains the log magnitude of the mel spectrogram of each utterance, while the norm folder contains the normalized spectrogram. The statistics used to normalize the spectrogram are computed from the training set, which is located in `dump/train/feats_stats.npy`.
+
+Also, there is a `metadata.jsonl` in each subfolder. It is a table-like file that contains id and paths to the spectrogram of each utterance.
+
+### Model Training
+```bash
+CUDA_VISIBLE_DEVICES=${gpus} ./local/train.sh ${conf_path} ${train_output_path}
+```
+`./local/train.sh` calls `${BIN_DIR}/train.py`.
+Here's the complete help message.
+
+```text
+usage: train.py [-h] [--config CONFIG] [--train-metadata TRAIN_METADATA]
+                [--dev-metadata DEV_METADATA] [--output-dir OUTPUT_DIR]
+                [--ngpu NGPU] [--batch-size BATCH_SIZE] [--max-iter MAX_ITER]
+                [--run-benchmark RUN_BENCHMARK]
+                [--profiler_options PROFILER_OPTIONS]
+
+Train a ParallelWaveGAN model.
+
+optional arguments:
+  -h, --help            show this help message and exit
+  --config CONFIG       ParallelWaveGAN config file.
+  --train-metadata TRAIN_METADATA
+                        training data.
+  --dev-metadata DEV_METADATA
+                        dev data.
+  --output-dir OUTPUT_DIR
+                        output dir.
+  --ngpu NGPU           if ngpu == 0, use cpu.
+
+benchmark:
+  arguments related to benchmark.
+
+  --batch-size BATCH_SIZE
+                        batch size.
+  --max-iter MAX_ITER   train max steps.
+  --run-benchmark RUN_BENCHMARK
+                        runing benchmark or not, if True, use the --batch-size
+                        and --max-iter.
+  --profiler_options PROFILER_OPTIONS
+                        The option of profiler, which should be in format
+                        "key1=value1;key2=value2;key3=value3".
+```
+
+1. `--config` is a config file in yaml format to overwrite the default config, which can be found at `conf/default.yaml`.
+2. `--train-metadata` and `--dev-metadata` should be the metadata file in the normalized subfolder of `train` and `dev` in the `dump` folder.
+3. `--output-dir` is the directory to save the results of the experiment. Checkpoints are saved in `checkpoints/` inside this directory.
+4. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
+
+### Synthesizing
+`./local/synthesize.sh` calls `${BIN_DIR}/../synthesize.py`, which can synthesize waveform from `metadata.jsonl`.
+```bash
+CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_path} ${ckpt_name}
+```
+```text
+usage: synthesize.py [-h] [--generator-type GENERATOR_TYPE] [--config CONFIG]
+                     [--checkpoint CHECKPOINT] [--test-metadata TEST_METADATA]
+                     [--output-dir OUTPUT_DIR] [--ngpu NGPU]
+
+Synthesize with GANVocoder.
+
+optional arguments:
+  -h, --help            show this help message and exit
+  --generator-type GENERATOR_TYPE
+                        type of GANVocoder, should in {pwgan, mb_melgan,
+                        style_melgan, } now
+  --config CONFIG       GANVocoder config file.
+  --checkpoint CHECKPOINT
+                        snapshot to load.
+  --test-metadata TEST_METADATA
+                        dev data.
+  --output-dir OUTPUT_DIR
+                        output dir.
+  --ngpu NGPU           if ngpu == 0, use cpu.
+```
+
+1. `--config` parallel wavegan config file. You should use the same config with which the model is trained.
+2. `--checkpoint` is the checkpoint to load. Pick one of the checkpoints from `checkpoints` inside the training output directory.
+3. `--test-metadata` is the metadata of the test dataset. Use the `metadata.jsonl` in the `dev/norm` subfolder from the processed directory.
+4. `--output-dir` is the directory to save the synthesized audio files.
+5. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
+
+## Pretrained Models
+The pretrained model can be downloaded here:
+- [pwgan_opencpop_ckpt_1.4.0](https://paddlespeech.bj.bcebos.com/t2s/svs/opencpop/pwgan_opencpop_ckpt_1.4.0.zip)
+
+
+Parallel WaveGAN checkpoint contains files listed below.
+
+```text
+pwgan_opencpop_ckpt_1.4.0
+├── default.yaml                    # default config used to train parallel wavegan
+├── snapshot_iter_100000.pdz        # generator parameters of parallel wavegan
+└── feats_stats.npy                 # statistics used to normalize spectrogram when training parallel wavegan
+```
+## Acknowledgement
+We adapted some code from https://github.com/kan-bayashi/ParallelWaveGAN.
--- a/examples/opencpop/voc1/conf/default.yaml
+++ b/examples/opencpop/voc1/conf/default.yaml
@ -0,0 +1,119 @@
+# This is the hyperparameter configuration file for Parallel WaveGAN.
+# Please make sure this is adjusted for the CSMSC dataset. If you want to
+# apply to the other dataset, you might need to carefully change some parameters.
+# This configuration requires 12 GB GPU memory and takes ~3 days on RTX TITAN.
+
+###########################################################
+#                FEATURE EXTRACTION SETTING               #
+###########################################################
+fs: 24000                # Sampling rate.
+n_fft: 512              # FFT size (samples).
+n_shift: 128             # Hop size (samples). 12.5ms
+win_length: 512         # Window length (samples). 50ms
+                         # If set to null, it will be the same as fft_size.
+window: "hann"           # Window function.
+n_mels: 80               # Number of mel basis.
+fmin: 30                 # Minimum freq in mel basis calculation. (Hz)
+fmax: 12000               # Maximum frequency in mel basis calculation. (Hz)
+
+
+###########################################################
+#         GENERATOR NETWORK ARCHITECTURE SETTING          #
+###########################################################
+generator_params:
+    in_channels: 1        # Number of input channels.
+    out_channels: 1       # Number of output channels.
+    kernel_size: 3        # Kernel size of dilated convolution.
+    layers: 30            # Number of residual block layers.
+    stacks: 3             # Number of stacks i.e., dilation cycles.
+    residual_channels: 64 # Number of channels in residual conv.
+    gate_channels: 128    # Number of channels in gated conv.
+    skip_channels: 64     # Number of channels in skip conv.
+    aux_channels: 80      # Number of channels for auxiliary feature conv.
+                          # Must be the same as num_mels.
+    aux_context_window: 2 # Context window size for auxiliary feature.
+                          # If set to 2, previous 2 and future 2 frames will be considered.
+    dropout: 0.0          # Dropout rate. 0.0 means no dropout applied.
+    bias: True            # use bias in residual blocks
+    use_weight_norm: True # Whether to use weight norm.
+                          # If set to true, it will be applied to all of the conv layers.
+    use_causal_conv: False               # use causal conv in residual blocks and upsample layers
+    upsample_scales: [8, 4, 2, 2]     # Upsampling scales. Prodcut of these must be the same as hop size.
+    interpolate_mode: "nearest" # upsample net interpolate mode
+    freq_axis_kernel_size: 1 # upsamling net: convolution kernel size in frequencey axis
+    nonlinear_activation: null
+    nonlinear_activation_params: {}
+
+###########################################################
+#       DISCRIMINATOR NETWORK ARCHITECTURE SETTING        #
+###########################################################
+discriminator_params:
+    in_channels: 1        # Number of input channels.
+    out_channels: 1       # Number of output channels.
+    kernel_size: 3        # Number of output channels.
+    layers: 10            # Number of conv layers.
+    conv_channels: 64     # Number of chnn layers.
+    bias: True            # Whether to use bias parameter in conv.
+    use_weight_norm: True # Whether to use weight norm.
+                          # If set to true, it will be applied to all of the conv layers.
+    nonlinear_activation: "leakyrelu" # Nonlinear function after each conv.
+    nonlinear_activation_params:      # Nonlinear function parameters
+        negative_slope: 0.2           # Alpha in leakyrelu.
+
+###########################################################
+#                   STFT LOSS SETTING                     #
+###########################################################
+stft_loss_params:
+    fft_sizes: [1024, 2048, 512]  # List of FFT size for STFT-based loss.
+    hop_sizes: [120, 240, 50]     # List of hop size for STFT-based loss
+    win_lengths: [600, 1200, 240] # List of window length for STFT-based loss.
+    window: "hann"         # Window function for STFT-based loss
+
+###########################################################
+#               ADVERSARIAL LOSS SETTING                  #
+###########################################################
+lambda_adv: 4.0  # Loss balancing coefficient.
+
+###########################################################
+#                  DATA LOADER SETTING                    #
+###########################################################
+batch_size: 8              # Batch size.
+batch_max_steps: 25500     # Length of each audio in batch. Make sure dividable by n_shift.
+num_workers: 1             # Number of workers in DataLoader.
+
+###########################################################
+#             OPTIMIZER & SCHEDULER SETTING               #
+###########################################################
+generator_optimizer_params:
+    epsilon: 1.0e-6        # Generator's epsilon.
+    weight_decay: 0.0      # Generator's weight decay coefficient.
+generator_scheduler_params:
+    learning_rate: 0.0001  # Generator's learning rate.
+    step_size: 200000      # Generator's scheduler step size.
+    gamma: 0.5             # Generator's scheduler gamma.
+                           # At each step size, lr will be multiplied by this parameter.
+generator_grad_norm: 10    # Generator's gradient norm.
+discriminator_optimizer_params:
+    epsilon: 1.0e-6            # Discriminator's epsilon.
+    weight_decay: 0.0          # Discriminator's weight decay coefficient.
+discriminator_scheduler_params:
+    learning_rate: 0.00005     # Discriminator's learning rate.
+    step_size: 200000          # Discriminator's scheduler step size.
+    gamma: 0.5                 # Discriminator's scheduler gamma.
+                               # At each step size, lr will be multiplied by this parameter.
+discriminator_grad_norm: 1     # Discriminator's gradient norm.
+
+###########################################################
+#                    INTERVAL SETTING                     #
+###########################################################
+discriminator_train_start_steps: 100000 # Number of steps to start to train discriminator.
+train_max_steps: 400000                 # Number of training steps.
+save_interval_steps: 5000               # Interval steps to save checkpoint.
+eval_interval_steps: 1000               # Interval steps to evaluate the network.
+
+###########################################################
+#                     OTHER SETTING                       #
+###########################################################
+num_save_intermediate_results: 4  # Number of results to be saved as intermediate results.
+num_snapshots: 10                 # max number of snapshots to keep while training
+seed: 42                          # random seed for paddle, random, and np.random
--- a/examples/opencpop/voc1/local/PTQ_static.sh
+++ b/examples/opencpop/voc1/local/PTQ_static.sh
@ -0,0 +1 @@
+../../../csmsc/voc1/local/PTQ_static.sh
--- a/examples/opencpop/voc1/local/dygraph_to_static.sh
+++ b/examples/opencpop/voc1/local/dygraph_to_static.sh
@ -0,0 +1,15 @@
+#!/bin/bash
+
+config_path=$1
+train_output_path=$2
+ckpt_name=$3
+
+FLAGS_allocator_strategy=naive_best_fit \
+FLAGS_fraction_of_gpu_memory_to_use=0.01 \
+python3 ${BIN_DIR}/../../dygraph_to_static.py \
+    --type=voc \
+    --voc=pwgan_opencpop \
+    --voc_config=${config_path} \
+    --voc_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
+    --voc_stat=dump/train/feats_stats.npy \
+    --inference_dir=exp/default/inference/
--- a/examples/opencpop/voc1/local/preprocess.sh
+++ b/examples/opencpop/voc1/local/preprocess.sh
@ -0,0 +1,47 @@
+#!/bin/bash
+
+stage=0
+stop_stage=100
+
+config_path=$1
+
+
+if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
+    # extract features
+    echo "Extract features ..."
+    python3 ${BIN_DIR}/../preprocess.py \
+        --rootdir=~/datasets/Opencpop/segments/ \
+        --dataset=opencpop \
+        --dumpdir=dump \
+        --dur-file=~/datasets/Opencpop/segments/transcriptions.txt \
+        --config=${config_path} \
+        --cut-sil=False \
+        --num-cpu=20
+fi
+
+if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
+    # get features' stats(mean and std)
+    echo "Get features' stats ..."
+    python3 ${MAIN_ROOT}/utils/compute_statistics.py \
+        --metadata=dump/train/raw/metadata.jsonl \
+        --field-name="feats"
+fi
+
+if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
+    # normalize, dev and test should use train's stats
+    echo "Normalize ..."
+   
+    python3 ${BIN_DIR}/../normalize.py \
+        --metadata=dump/train/raw/metadata.jsonl \
+        --dumpdir=dump/train/norm \
+        --stats=dump/train/feats_stats.npy
+    python3 ${BIN_DIR}/../normalize.py \
+        --metadata=dump/dev/raw/metadata.jsonl \
+        --dumpdir=dump/dev/norm \
+        --stats=dump/train/feats_stats.npy
+    
+    python3 ${BIN_DIR}/../normalize.py \
+        --metadata=dump/test/raw/metadata.jsonl \
+        --dumpdir=dump/test/norm \
+        --stats=dump/train/feats_stats.npy
+fi
--- a/examples/opencpop/voc1/local/synthesize.sh
+++ b/examples/opencpop/voc1/local/synthesize.sh
@ -0,0 +1 @@
+../../../csmsc/voc1/local/synthesize.sh
--- a/examples/opencpop/voc1/local/train.sh
+++ b/examples/opencpop/voc1/local/train.sh
@ -0,0 +1 @@
+../../../csmsc/voc1/local/train.sh
--- a/examples/opencpop/voc1/path.sh
+++ b/examples/opencpop/voc1/path.sh
@ -0,0 +1 @@
+../../csmsc/voc1/path.sh
--- a/examples/opencpop/voc1/run.sh
+++ b/examples/opencpop/voc1/run.sh
@ -0,0 +1,42 @@
+#!/bin/bash
+
+set -e
+source path.sh
+
+gpus=0
+stage=0
+stop_stage=100
+
+conf_path=conf/default.yaml
+train_output_path=exp/default
+ckpt_name=snapshot_iter_100000.pdz
+
+# with the following command, you can choose the stage range you want to run
+# such as `./run.sh --stage 0 --stop-stage 0`
+# this can not be mixed use with `$1`, `$2` ...
+source ${MAIN_ROOT}/utils/parse_options.sh || exit 1
+
+if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
+    # prepare data
+    ./local/preprocess.sh ${conf_path} || exit -1
+fi
+
+if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
+    # train model, all `ckpt` under `train_output_path/checkpoints/` dir
+    CUDA_VISIBLE_DEVICES=${gpus} ./local/train.sh ${conf_path} ${train_output_path} || exit -1
+fi
+
+if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
+    # synthesize
+    CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_path} ${ckpt_name} || exit -1
+fi
+
+# dygraph to static
+if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
+    CUDA_VISIBLE_DEVICES=${gpus} ./local/dygraph_to_static.sh  ${conf_path} ${train_output_path} ${ckpt_name} || exit -1
+fi
+
+# PTQ_static
+if [ ${stage} -le 4 ] && [ ${stop_stage} -ge 4 ]; then
+    CUDA_VISIBLE_DEVICES=${gpus} ./local/PTQ_static.sh  ${train_output_path} pwgan_opencpop || exit -1
+fi
--- a/examples/opencpop/voc5/conf/default.yaml
+++ b/examples/opencpop/voc5/conf/default.yaml
@ -0,0 +1,167 @@
+# This is the configuration file for CSMSC dataset.
+# This configuration is based on HiFiGAN V1, which is an official configuration. 
+# But I found that the optimizer setting does not work well with my implementation.
+# So I changed optimizer settings as follows:
+# - AdamW -> Adam
+# - betas: [0.8, 0.99] -> betas: [0.5, 0.9]
+# - Scheduler: ExponentialLR -> MultiStepLR
+# To match the shift size difference, the upsample scales is also modified from the original 256 shift setting.
+
+###########################################################
+#                FEATURE EXTRACTION SETTING               #
+###########################################################
+fs: 24000                # Sampling rate.
+n_fft: 512              # FFT size (samples).
+n_shift: 128             # Hop size (samples). 12.5ms
+win_length: 512         # Window length (samples). 50ms
+                         # If set to null, it will be the same as fft_size.
+window: "hann"           # Window function.
+n_mels: 80               # Number of mel basis.
+fmin: 80                 # Minimum freq in mel basis calculation. (Hz)
+fmax: 12000               # Maximum frequency in mel basis calculation. (Hz)
+
+###########################################################
+#         GENERATOR NETWORK ARCHITECTURE SETTING          #
+###########################################################
+generator_params:
+    in_channels: 80                       # Number of input channels.
+    out_channels: 1                       # Number of output channels.
+    channels: 512                         # Number of initial channels.
+    kernel_size: 7                        # Kernel size of initial and final conv layers.
+    upsample_scales: [8, 4, 2, 2]         # Upsampling scales.
+    upsample_kernel_sizes: [16, 8, 4, 4] # Kernel size for upsampling layers.
+    resblock_kernel_sizes: [3, 7, 11]     # Kernel size for residual blocks.
+    resblock_dilations:                   # Dilations for residual blocks.
+        - [1, 3, 5]
+        - [1, 3, 5]
+        - [1, 3, 5]
+    use_additional_convs: True            # Whether to use additional conv layer in residual blocks.
+    bias: True                            # Whether to use bias parameter in conv.
+    nonlinear_activation: "leakyrelu"     # Nonlinear activation type.
+    nonlinear_activation_params:          # Nonlinear activation paramters.
+        negative_slope: 0.1
+    use_weight_norm: True                 # Whether to apply weight normalization.
+
+
+###########################################################
+#       DISCRIMINATOR NETWORK ARCHITECTURE SETTING        #
+###########################################################
+discriminator_params:
+    scales: 3                              # Number of multi-scale discriminator.
+    scale_downsample_pooling: "AvgPool1D"  # Pooling operation for scale discriminator.
+    scale_downsample_pooling_params:
+        kernel_size: 4                     # Pooling kernel size.
+        stride: 2                          # Pooling stride.
+        padding: 2                         # Padding size.
+    scale_discriminator_params:
+        in_channels: 1                     # Number of input channels.
+        out_channels: 1                    # Number of output channels.
+        kernel_sizes: [15, 41, 5, 3]       # List of kernel sizes.
+        channels: 128                      # Initial number of channels.
+        max_downsample_channels: 1024      # Maximum number of channels in downsampling conv layers.
+        max_groups: 16                     # Maximum number of groups in downsampling conv layers.
+        bias: True
+        downsample_scales: [4, 4, 4, 4, 1] # Downsampling scales.
+        nonlinear_activation: "leakyrelu"  # Nonlinear activation.
+        nonlinear_activation_params:
+            negative_slope: 0.1
+    follow_official_norm: True             # Whether to follow the official norm setting.
+    periods: [2, 3, 5, 7, 11]              # List of period for multi-period discriminator.
+    period_discriminator_params:
+        in_channels: 1                     # Number of input channels.
+        out_channels: 1                    # Number of output channels.
+        kernel_sizes: [5, 3]               # List of kernel sizes.
+        channels: 32                       # Initial number of channels.
+        downsample_scales: [3, 3, 3, 3, 1] # Downsampling scales.
+        max_downsample_channels: 1024      # Maximum number of channels in downsampling conv layers.
+        bias: True                         # Whether to use bias parameter in conv layer."
+        nonlinear_activation: "leakyrelu"  # Nonlinear activation.
+        nonlinear_activation_params:       # Nonlinear activation paramters.
+            negative_slope: 0.1
+        use_weight_norm: True              # Whether to apply weight normalization.
+        use_spectral_norm: False           # Whether to apply spectral normalization.
+    
+
+###########################################################
+#                   STFT LOSS SETTING                     #
+###########################################################
+use_stft_loss: False                 # Whether to use multi-resolution STFT loss.
+use_mel_loss: True                   # Whether to use Mel-spectrogram loss.
+mel_loss_params:
+    fs: 24000
+    fft_size: 512
+    hop_size: 128
+    win_length: 512
+    window: "hann"
+    num_mels: 80
+    fmin: 30
+    fmax: 12000
+    log_base: null
+generator_adv_loss_params:
+    average_by_discriminators: False # Whether to average loss by #discriminators.
+discriminator_adv_loss_params:
+    average_by_discriminators: False # Whether to average loss by #discriminators.
+use_feat_match_loss: True
+feat_match_loss_params:
+    average_by_discriminators: False # Whether to average loss by #discriminators.
+    average_by_layers: False         # Whether to average loss by #layers in each discriminator.
+    include_final_outputs: False     # Whether to include final outputs in feat match loss calculation.
+
+###########################################################
+#               ADVERSARIAL LOSS SETTING                  #
+###########################################################
+lambda_aux: 45.0       # Loss balancing coefficient for STFT loss.
+lambda_adv: 1.0        # Loss balancing coefficient for adversarial loss.
+lambda_feat_match: 2.0 # Loss balancing coefficient for feat match loss..
+
+###########################################################
+#                  DATA LOADER SETTING                    #
+###########################################################
+batch_size: 16              # Batch size.
+batch_max_steps: 8400       # Length of each audio in batch. Make sure dividable by hop_size.
+num_workers: 1              # Number of workers in DataLoader.
+
+###########################################################
+#             OPTIMIZER & SCHEDULER SETTING               #
+###########################################################
+generator_optimizer_params:
+    beta1: 0.5
+    beta2: 0.9
+    weight_decay: 0.0                   # Generator's weight decay coefficient.
+generator_scheduler_params:
+    learning_rate: 2.0e-4               # Generator's learning rate.
+    gamma: 0.5                          # Generator's scheduler gamma.
+    milestones:                         # At each milestone, lr will be multiplied by gamma.
+        - 200000
+        - 400000
+        - 600000
+        - 800000
+generator_grad_norm: -1                 # Generator's gradient norm.
+discriminator_optimizer_params:
+    beta1: 0.5
+    beta2: 0.9
+    weight_decay: 0.0                   # Discriminator's weight decay coefficient.
+discriminator_scheduler_params:
+    learning_rate: 2.0e-4               # Discriminator's learning rate.
+    gamma: 0.5                          # Discriminator's scheduler gamma.
+    milestones:                         # At each milestone, lr will be multiplied by gamma.
+        - 200000
+        - 400000
+        - 600000
+        - 800000    
+discriminator_grad_norm: -1             # Discriminator's gradient norm.            
+
+###########################################################
+#                    INTERVAL SETTING                     #
+###########################################################
+generator_train_start_steps: 1     # Number of steps to start to train discriminator.
+discriminator_train_start_steps: 0 # Number of steps to start to train discriminator.
+train_max_steps: 2500000           # Number of training steps.
+save_interval_steps: 5000         # Interval steps to save checkpoint.
+eval_interval_steps: 1000          # Interval steps to evaluate the network.
+
+###########################################################
+#                     OTHER SETTING                       #
+###########################################################
+num_snapshots: 4                 # max number of snapshots to keep while training
+seed: 42                          # random seed for paddle, random, and np.random
--- a/examples/opencpop/voc5/conf/finetune.yaml
+++ b/examples/opencpop/voc5/conf/finetune.yaml
@ -0,0 +1,168 @@
+# This is the configuration file for CSMSC dataset.
+# This configuration is based on HiFiGAN V1, which is an official configuration. 
+# But I found that the optimizer setting does not work well with my implementation.
+# So I changed optimizer settings as follows:
+# - AdamW -> Adam
+# - betas: [0.8, 0.99] -> betas: [0.5, 0.9]
+# - Scheduler: ExponentialLR -> MultiStepLR
+# To match the shift size difference, the upsample scales is also modified from the original 256 shift setting.
+
+###########################################################
+#                FEATURE EXTRACTION SETTING               #
+###########################################################
+fs: 24000                # Sampling rate.
+n_fft: 512              # FFT size (samples).
+n_shift: 128             # Hop size (samples). 12.5ms
+win_length: 512         # Window length (samples). 50ms
+                         # If set to null, it will be the same as fft_size.
+window: "hann"           # Window function.
+n_mels: 80               # Number of mel basis.
+fmin: 80                 # Minimum freq in mel basis calculation. (Hz)
+fmax: 12000               # Maximum frequency in mel basis calculation. (Hz)
+
+###########################################################
+#         GENERATOR NETWORK ARCHITECTURE SETTING          #
+###########################################################
+generator_params:
+    in_channels: 80                       # Number of input channels.
+    out_channels: 1                       # Number of output channels.
+    channels: 512                         # Number of initial channels.
+    kernel_size: 7                        # Kernel size of initial and final conv layers.
+    upsample_scales: [8, 4, 2, 2]         # Upsampling scales.
+    upsample_kernel_sizes: [16, 8, 4, 4] # Kernel size for upsampling layers.
+    resblock_kernel_sizes: [3, 7, 11]     # Kernel size for residual blocks.
+    resblock_dilations:                   # Dilations for residual blocks.
+        - [1, 3, 5]
+        - [1, 3, 5]
+        - [1, 3, 5]
+    use_additional_convs: True            # Whether to use additional conv layer in residual blocks.
+    bias: True                            # Whether to use bias parameter in conv.
+    nonlinear_activation: "leakyrelu"     # Nonlinear activation type.
+    nonlinear_activation_params:          # Nonlinear activation paramters.
+        negative_slope: 0.1
+    use_weight_norm: True                 # Whether to apply weight normalization.
+
+
+###########################################################
+#       DISCRIMINATOR NETWORK ARCHITECTURE SETTING        #
+###########################################################
+discriminator_params:
+    scales: 3                              # Number of multi-scale discriminator.
+    scale_downsample_pooling: "AvgPool1D"  # Pooling operation for scale discriminator.
+    scale_downsample_pooling_params:
+        kernel_size: 4                     # Pooling kernel size.
+        stride: 2                          # Pooling stride.
+        padding: 2                         # Padding size.
+    scale_discriminator_params:
+        in_channels: 1                     # Number of input channels.
+        out_channels: 1                    # Number of output channels.
+        kernel_sizes: [15, 41, 5, 3]       # List of kernel sizes.
+        channels: 128                      # Initial number of channels.
+        max_downsample_channels: 1024      # Maximum number of channels in downsampling conv layers.
+        max_groups: 16                     # Maximum number of groups in downsampling conv layers.
+        bias: True
+        downsample_scales: [4, 4, 4, 4, 1] # Downsampling scales.
+        nonlinear_activation: "leakyrelu"  # Nonlinear activation.
+        nonlinear_activation_params:
+            negative_slope: 0.1
+    follow_official_norm: True             # Whether to follow the official norm setting.
+    periods: [2, 3, 5, 7, 11]              # List of period for multi-period discriminator.
+    period_discriminator_params:
+        in_channels: 1                     # Number of input channels.
+        out_channels: 1                    # Number of output channels.
+        kernel_sizes: [5, 3]               # List of kernel sizes.
+        channels: 32                       # Initial number of channels.
+        downsample_scales: [3, 3, 3, 3, 1] # Downsampling scales.
+        max_downsample_channels: 1024      # Maximum number of channels in downsampling conv layers.
+        bias: True                         # Whether to use bias parameter in conv layer."
+        nonlinear_activation: "leakyrelu"  # Nonlinear activation.
+        nonlinear_activation_params:       # Nonlinear activation paramters.
+            negative_slope: 0.1
+        use_weight_norm: True              # Whether to apply weight normalization.
+        use_spectral_norm: False           # Whether to apply spectral normalization.
+    
+
+###########################################################
+#                   STFT LOSS SETTING                     #
+###########################################################
+use_stft_loss: False                 # Whether to use multi-resolution STFT loss.
+use_mel_loss: True                   # Whether to use Mel-spectrogram loss.
+mel_loss_params:
+    fs: 24000
+    fft_size: 512
+    hop_size: 128
+    win_length: 512
+    window: "hann"
+    num_mels: 80
+    fmin: 30
+    fmax: 12000
+    log_base: null
+generator_adv_loss_params:
+    average_by_discriminators: False # Whether to average loss by #discriminators.
+discriminator_adv_loss_params:
+    average_by_discriminators: False # Whether to average loss by #discriminators.
+use_feat_match_loss: True
+feat_match_loss_params:
+    average_by_discriminators: False # Whether to average loss by #discriminators.
+    average_by_layers: False         # Whether to average loss by #layers in each discriminator.
+    include_final_outputs: False     # Whether to include final outputs in feat match loss calculation.
+
+###########################################################
+#               ADVERSARIAL LOSS SETTING                  #
+###########################################################
+lambda_aux: 45.0       # Loss balancing coefficient for STFT loss.
+lambda_adv: 1.0        # Loss balancing coefficient for adversarial loss.
+lambda_feat_match: 2.0 # Loss balancing coefficient for feat match loss..
+
+###########################################################
+#                  DATA LOADER SETTING                    #
+###########################################################
+#batch_size: 16              # Batch size.
+batch_size: 1              # Batch size.
+batch_max_steps: 8400       # Length of each audio in batch. Make sure dividable by hop_size.
+num_workers: 1              # Number of workers in DataLoader.
+
+###########################################################
+#             OPTIMIZER & SCHEDULER SETTING               #
+###########################################################
+generator_optimizer_params:
+    beta1: 0.5
+    beta2: 0.9
+    weight_decay: 0.0                   # Generator's weight decay coefficient.
+generator_scheduler_params:
+    learning_rate: 2.0e-4               # Generator's learning rate.
+    gamma: 0.5                          # Generator's scheduler gamma.
+    milestones:                         # At each milestone, lr will be multiplied by gamma.
+        - 200000
+        - 400000
+        - 600000
+        - 800000
+generator_grad_norm: -1                 # Generator's gradient norm.
+discriminator_optimizer_params:
+    beta1: 0.5
+    beta2: 0.9
+    weight_decay: 0.0                   # Discriminator's weight decay coefficient.
+discriminator_scheduler_params:
+    learning_rate: 2.0e-4               # Discriminator's learning rate.
+    gamma: 0.5                          # Discriminator's scheduler gamma.
+    milestones:                         # At each milestone, lr will be multiplied by gamma.
+        - 200000
+        - 400000
+        - 600000
+        - 800000    
+discriminator_grad_norm: -1             # Discriminator's gradient norm.            
+
+###########################################################
+#                    INTERVAL SETTING                     #
+###########################################################
+generator_train_start_steps: 1     # Number of steps to start to train discriminator.
+discriminator_train_start_steps: 0 # Number of steps to start to train discriminator.
+train_max_steps: 2600000           # Number of training steps.
+save_interval_steps: 5000         # Interval steps to save checkpoint.
+eval_interval_steps: 1000          # Interval steps to evaluate the network.
+
+###########################################################
+#                     OTHER SETTING                       #
+###########################################################
+num_snapshots: 4                 # max number of snapshots to keep while training
+seed: 42                          # random seed for paddle, random, and np.random
--- a/examples/opencpop/voc5/finetune.sh
+++ b/examples/opencpop/voc5/finetune.sh
@ -0,0 +1,74 @@
+#!/bin/bash
+
+source path.sh
+
+gpus=0
+stage=0
+stop_stage=100
+
+source ${MAIN_ROOT}/utils/parse_options.sh || exit 1
+
+if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
+    python3 ${MAIN_ROOT}/paddlespeech/t2s/exps/diffsinger/gen_gta_mel.py \
+        --diffsinger-config=diffsinger_opencpop_ckpt_1.4.0/default.yaml \
+        --diffsinger-checkpoint=diffsinger_opencpop_ckpt_1.4.0/snapshot_iter_160000.pdz \
+        --diffsinger-stat=diffsinger_opencpop_ckpt_1.4.0/speech_stats.npy \
+        --diffsinger-stretch=diffsinger_opencpop_ckpt_1.4.0/speech_stretchs.npy \
+        --dur-file=~/datasets/Opencpop/segments/transcriptions.txt \
+        --output-dir=dump_finetune \
+        --phones-dict=diffsinger_opencpop_ckpt_1.4.0/phone_id_map.txt \
+        --dataset=opencpop \
+        --rootdir=~/datasets/Opencpop/segments/
+fi
+
+if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
+    python3 ${MAIN_ROOT}/utils/link_wav.py \
+        --old-dump-dir=dump \
+        --dump-dir=dump_finetune
+fi
+
+if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
+    # get features' stats(mean and std)
+    echo "Get features' stats ..."
+    cp dump/train/feats_stats.npy dump_finetune/train/
+fi
+
+if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
+    # normalize, dev and test should use train's stats
+    echo "Normalize ..."
+   
+    python3 ${BIN_DIR}/../normalize.py \
+        --metadata=dump_finetune/train/raw/metadata.jsonl \
+        --dumpdir=dump_finetune/train/norm \
+        --stats=dump_finetune/train/feats_stats.npy
+    python3 ${BIN_DIR}/../normalize.py \
+        --metadata=dump_finetune/dev/raw/metadata.jsonl \
+        --dumpdir=dump_finetune/dev/norm \
+        --stats=dump_finetune/train/feats_stats.npy
+    
+    python3 ${BIN_DIR}/../normalize.py \
+        --metadata=dump_finetune/test/raw/metadata.jsonl \
+        --dumpdir=dump_finetune/test/norm \
+        --stats=dump_finetune/train/feats_stats.npy
+fi
+
+# create finetune env
+if [ ${stage} -le 4 ] && [ ${stop_stage} -ge 4 ]; then
+    echo "create finetune env"
+    python3 local/prepare_env.py \
+        --pretrained_model_dir=exp/default/checkpoints/ \
+        --output_dir=exp/finetune/
+fi 
+
+# finetune
+if [ ${stage} -le 5 ] && [ ${stop_stage} -ge 5 ]; then
+    CUDA_VISIBLE_DEVICES=${gpus} \
+    FLAGS_cudnn_exhaustive_search=true \
+    FLAGS_conv_workspace_size_limit=4000 \
+    python ${BIN_DIR}/train.py \
+        --train-metadata=dump_finetune/train/norm/metadata.jsonl \
+        --dev-metadata=dump_finetune/dev/norm/metadata.jsonl \
+        --config=conf/finetune.yaml \
+        --output-dir=exp/finetune \
+        --ngpu=1
+fi 
--- a/examples/opencpop/voc5/local/PTQ_static.sh
+++ b/examples/opencpop/voc5/local/PTQ_static.sh
@ -0,0 +1 @@
+../../../csmsc/voc1/local/PTQ_static.sh
--- a/examples/opencpop/voc5/local/dygraph_to_static.sh
+++ b/examples/opencpop/voc5/local/dygraph_to_static.sh
@ -0,0 +1,15 @@
+#!/bin/bash
+
+config_path=$1
+train_output_path=$2
+ckpt_name=$3
+
+FLAGS_allocator_strategy=naive_best_fit \
+FLAGS_fraction_of_gpu_memory_to_use=0.01 \
+python3 ${BIN_DIR}/../../dygraph_to_static.py \
+    --type=voc \
+    --voc=hifigan_opencpop \
+    --voc_config=${config_path} \
+    --voc_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
+    --voc_stat=dump/train/feats_stats.npy \
+    --inference_dir=exp/default/inference/
--- a/examples/opencpop/voc5/local/prepare_env.py
+++ b/examples/opencpop/voc5/local/prepare_env.py
@ -0,0 +1 @@
+../../../other/tts_finetune/tts3/local/prepare_env.py
--- a/examples/opencpop/voc5/local/preprocess.sh
+++ b/examples/opencpop/voc5/local/preprocess.sh
@ -0,0 +1 @@
+../../voc1/local/preprocess.sh
--- a/examples/opencpop/voc5/local/synthesize.sh
+++ b/examples/opencpop/voc5/local/synthesize.sh
@ -0,0 +1 @@
+../../../csmsc/voc5/local/synthesize.sh
--- a/examples/opencpop/voc5/local/train.sh
+++ b/examples/opencpop/voc5/local/train.sh
@ -0,0 +1 @@
+../../../csmsc/voc1/local/train.sh
--- a/examples/opencpop/voc5/path.sh
+++ b/examples/opencpop/voc5/path.sh
@ -0,0 +1 @@
+../../csmsc/voc5/path.sh
--- a/examples/opencpop/voc5/run.sh
+++ b/examples/opencpop/voc5/run.sh
@ -0,0 +1,42 @@
+#!/bin/bash
+
+set -e
+source path.sh
+
+gpus=0
+stage=0
+stop_stage=100
+
+conf_path=conf/default.yaml
+train_output_path=exp/default
+ckpt_name=snapshot_iter_2500000.pdz
+
+# with the following command, you can choose the stage range you want to run
+# such as `./run.sh --stage 0 --stop-stage 0`
+# this can not be mixed use with `$1`, `$2` ...
+source ${MAIN_ROOT}/utils/parse_options.sh || exit 1
+
+if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
+    # prepare data
+    ./local/preprocess.sh ${conf_path} || exit -1
+fi
+
+if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
+    # train model, all `ckpt` under `train_output_path/checkpoints/` dir
+    CUDA_VISIBLE_DEVICES=${gpus} ./local/train.sh ${conf_path} ${train_output_path} || exit -1
+fi
+
+if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
+    # synthesize
+    CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_path} ${ckpt_name} || exit -1
+fi
+
+# dygraph to static
+if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
+    CUDA_VISIBLE_DEVICES=${gpus} ./local/dygraph_to_static.sh  ${conf_path} ${train_output_path} ${ckpt_name} || exit -1
+fi
+
+# PTQ_static
+if [ ${stage} -le 4 ] && [ ${stop_stage} -ge 4 ]; then
+    CUDA_VISIBLE_DEVICES=${gpus} ./local/PTQ_static.sh  ${train_output_path} hifigan_opencpop || exit -1
+fi
--- a/examples/other/tn/data/textnorm_test_cases.txt
+++ b/examples/other/tn/data/textnorm_test_cases.txt
@ -32,7 +32,7 @@ iPad Pro的秒控键盘这次也推出白色版本。|iPad Pro的秒控键盘这
 明天有62%的概率降雨|明天有百分之六十二的概率降雨
 这是固话0421-33441122|这是固话零四二一三三四四一一二二
 这是手机+86 18544139121|这是手机八六一八五四四一三九一二一
-小王的身高是153.5cm,梦想是打篮球!我觉得有0.1%的可能性。|小王的身高是一百五十三点五cm,梦想是打篮球!我觉得有百分之零点一的可能性。
+小王的身高是153.5cm,梦想是打篮球!我觉得有0.1%的可能性。|小王的身高是一百五十三点五厘米,梦想是打篮球!我觉得有百分之零点一的可能性。
 不管三七二十一|不管三七二十一
 九九八十一难|九九八十一难
 2018年5月23号上午10点10分|二零一八年五月二十三号上午十点十分
--- a/Show More
+++ b/Show More
				`@ -0,0 +1 @@`
				`../../../other/tts_finetune/tts3/local/prepare_env.py`