[FastSpeech2](https://arxiv.org/abs/2006.04558) is a classical acoustic model for Text-to-Speech synthesis, which introduces controllable speech input, including `phoneme duration`、`energy` and `pitch`.
In the prediction phase, you can change these controllable variables to get some interesting results.
For example:
1. The `duration` control in `FastSpeech2` can control the speed of audios will keep the `pitch`. (in some speech tool, increase the speed will increase the pitch, and vice versa.)
2. When we set `pitch` of one sentence to a mean value and set `tones` of phones to `1`, we will get a `robot-style` timbre.
3. When we raise the `pitch` of an adult female (with a fixed scale ratio), we will get a `child-style` timbre.
The `duration` and `pitch` of different phonemes in a sentence can have different scale ratios. You can set different scale ratios to emphasize or weaken the pronunciation of some phonemes.