[FastSpeech2](https://arxiv.org/abs/2006.04558) is a classical acoustic model for Text-to-Speech synthesis, which introduces controllable speech input, including `phoneme duration`、 `energy` and `pitch`.
1. The `duration` control in `FastSpeech2` can control the speed of audios will keep the `pitch`. (in some speech tools, increasing the speed will increase the pitch and vice versa.)
3. When we raise the `pitch` of an adult female (with a fixed scale ratio), we will get a `child-style` timbre.
The `duration` and `pitch` of different phonemes in a sentence can have different scale ratios. You can set different scale ratios to emphasize or weaken the pronunciation of some phonemes.