@ -15,12 +15,21 @@ You can choose one way from meduim and hard to install paddlespeech.
### 2. Prepare config File
### 2. Prepare config File
The configuration file can be found in `conf/tts_online_application.yaml` 。
The configuration file can be found in `conf/tts_online_application.yaml`.
Among them, `protocol` indicates the network protocol used by the streaming TTS service. Currently, both http and websocket are supported.
- `protocol` indicates the network protocol used by the streaming TTS service. Currently, both http and websocket are supported.
`engine_list` indicates the speech engine that will be included in the service to be started, in the format of `<speech task>_<engine type>`.
- `engine_list` indicates the speech engine that will be included in the service to be started, in the format of `<speech task>_<engine type>`.
This demo mainly introduces the streaming speech synthesis service, so the speech task should be set to `tts`.
- This demo mainly introduces the streaming speech synthesis service, so the speech task should be set to `tts`.
Currently, the engine type supports two forms: **online** and **online-onnx**. `online` indicates an engine that uses python for dynamic graph inference; `online-onnx` indicates an engine that uses onnxruntime for inference. The inference speed of online-onnx is faster.
- the engine type supports two forms: **online** and **online-onnx**. `online` indicates an engine that uses python for dynamic graph inference; `online-onnx` indicates an engine that uses onnxruntime for inference. The inference speed of online-onnx is faster.
Streaming TTS AM model support: **fastspeech2 and fastspeech2_cnndecoder**; Voc model support: **hifigan and mb_melgan**
- Streaming TTS engine AM model support: **fastspeech2 and fastspeech2_cnndecoder**; Voc model support: **hifigan and mb_melgan**
- In streaming am inference, one chunk of data is inferred at a time to achieve a streaming effect. Among them, `am_block` indicates the number of valid frames in the chunk, and `am_pad` indicates the number of frames added before and after am_block in a chunk. The existence of am_pad is used to eliminate errors caused by streaming inference and avoid the influence of streaming inference on the quality of synthesized audio.
- fastspeech2 does not support streaming am inference, so am_pad and am_block have no effect on it.
- fastspeech2_cnndecoder supports streaming inference. When am_pad=12, streaming inference synthesized audio is consistent with non-streaming synthesized audio.
- In streaming voc inference, one chunk of data is inferred at a time to achieve a streaming effect. Where `voc_block` indicates the number of valid frames in the chunk, and `voc_pad` indicates the number of frames added before and after the voc_block in a chunk. The existence of voc_pad is used to eliminate errors caused by streaming inference and avoid the influence of streaming inference on the quality of synthesized audio.
- Both hifigan and mb_melgan support streaming voc inference.
- When the voc model is mb_melgan, when voc_pad=14, the synthetic audio for streaming inference is consistent with the non-streaming synthetic audio; the minimum voc_pad can be set to 7, and the synthetic audio has no abnormal hearing. If the voc_pad is less than 7, the synthetic audio sounds abnormal.
- When the voc model is hifigan, when voc_pad=20, the streaming inference synthetic audio is consistent with the non-streaming synthetic audio; when voc_pad=14, the synthetic audio has no abnormal hearing.
# Both mb_melgan_csmsc and hifigan_csmsc support streaming voc inference
voc:'mb_melgan_csmsc'
voc:'mb_melgan_csmsc'
voc_config:
voc_config:
voc_ckpt:
voc_ckpt:
@ -39,8 +41,13 @@ tts_online:
# others
# others
lang:'zh'
lang:'zh'
device:'cpu'# set 'gpu:id' or 'cpu'
device:'cpu'# set 'gpu:id' or 'cpu'
# am_block and am_pad only for fastspeech2_cnndecoder_onnx model to streaming am infer,
# when am_pad set 12, streaming synthetic audio is the same as non-streaming synthetic audio
am_block:42
am_block:42
am_pad:12
am_pad:12
# voc_pad and voc_block voc model to streaming voc infer,
# when voc model is mb_melgan_csmsc, voc_pad set 14, streaming synthetic audio is the same as non-streaming synthetic audio; The minimum value of pad can be set to 7, streaming synthetic audio sounds normal
# when voc model is hifigan_csmsc, voc_pad set 20, streaming synthetic audio is the same as non-streaming synthetic audio; voc_pad set 14, streaming synthetic audio sounds normal
# Both mb_melgan_csmsc_onnx and hifigan_csmsc_onnx support streaming voc inference
voc:'hifigan_csmsc_onnx'
voc:'hifigan_csmsc_onnx'
voc_ckpt:
voc_ckpt:
voc_sample_rate:24000
voc_sample_rate:24000
@ -80,9 +89,15 @@ tts_online-onnx:
# others
# others
lang:'zh'
lang:'zh'
# am_block and am_pad only for fastspeech2_cnndecoder_onnx model to streaming am infer,
# when am_pad set 12, streaming synthetic audio is the same as non-streaming synthetic audio
am_block:42
am_block:42
am_pad:12
am_pad:12
# voc_pad and voc_block voc model to streaming voc infer,
# when voc model is mb_melgan_csmsc_onnx, voc_pad set 14, streaming synthetic audio is the same as non-streaming synthetic audio; The minimum value of pad can be set to 7, streaming synthetic audio sounds normal
# when voc model is hifigan_csmsc_onnx, voc_pad set 20, streaming synthetic audio is the same as non-streaming synthetic audio; voc_pad set 14, streaming synthetic audio sounds normal
voc_block:14
voc_block:14
voc_pad:14
voc_pad:14
# voc_upsample should be same as n_shift on voc config.
# Both mb_melgan_csmsc and hifigan_csmsc support streaming voc inference
voc:'mb_melgan_csmsc'
voc:'mb_melgan_csmsc'
voc_config:
voc_config:
voc_ckpt:
voc_ckpt:
@ -39,8 +41,13 @@ tts_online:
# others
# others
lang:'zh'
lang:'zh'
device:'cpu'# set 'gpu:id' or 'cpu'
device:'cpu'# set 'gpu:id' or 'cpu'
# am_block and am_pad only for fastspeech2_cnndecoder_onnx model to streaming am infer,
# when am_pad set 12, streaming synthetic audio is the same as non-streaming synthetic audio
am_block:42
am_block:42
am_pad:12
am_pad:12
# voc_pad and voc_block voc model to streaming voc infer,
# when voc model is mb_melgan_csmsc, voc_pad set 14, streaming synthetic audio is the same as non-streaming synthetic audio; The minimum value of pad can be set to 7, streaming synthetic audio sounds normal
# when voc model is hifigan_csmsc, voc_pad set 20, streaming synthetic audio is the same as non-streaming synthetic audio; voc_pad set 14, streaming synthetic audio sounds normal
# Both mb_melgan_csmsc_onnx and hifigan_csmsc_onnx support streaming voc inference
voc:'hifigan_csmsc_onnx'
voc:'hifigan_csmsc_onnx'
voc_ckpt:
voc_ckpt:
voc_sample_rate:24000
voc_sample_rate:24000
@ -80,9 +89,15 @@ tts_online-onnx:
# others
# others
lang:'zh'
lang:'zh'
# am_block and am_pad only for fastspeech2_cnndecoder_onnx model to streaming am infer,
# when am_pad set 12, streaming synthetic audio is the same as non-streaming synthetic audio
am_block:42
am_block:42
am_pad:12
am_pad:12
# voc_pad and voc_block voc model to streaming voc infer,
# when voc model is mb_melgan_csmsc_onnx, voc_pad set 14, streaming synthetic audio is the same as non-streaming synthetic audio; The minimum value of pad can be set to 7, streaming synthetic audio sounds normal
# when voc model is hifigan_csmsc_onnx, voc_pad set 20, streaming synthetic audio is the same as non-streaming synthetic audio; voc_pad set 14, streaming synthetic audio sounds normal
voc_block:14
voc_block:14
voc_pad:14
voc_pad:14
# voc_upsample should be same as n_shift on voc config.