1.2 KiB
Style FastSpeech2
Introduction
FastSpeech2 is a classical acoustic model for Text-to-Speech synthesis, which introduces controllable speech input, including phoneme duration、energy and pitch.
In the prediction phase, you can change these controllable variables to get some interesting results.
For example:
-
The
durationcontrol inFastSpeech2can control the speed of audios will keep thepitch. (in some speech tool, increase the speed will increase the pitch, and vice versa.) -
When we set
pitchof one sentence to a mean value and settonesof phones to1, we will get arobot-styletimbre. -
When we raise the
pitchof an adult female (with a fixed scale ratio), we will get achild-styletimbre.
The duration and pitch of different phonemes in a sentence can have different scale ratios. You can set different scale ratios to emphasize or weaken the pronunciation of some phonemes.
Usage
Run the following command line to get started:
./run.sh
For more details, please see style_syn.py
The audio samples are in style-control-in-fastspeech2