Merge branch 'develop' into u2

pull/643/head
Hui Zhang 4 years ago
commit 5681736acc

@ -9,29 +9,24 @@
*PaddleASR* is an open-source implementation of end-to-end Automatic Speech Recognition (ASR) engine, with [PaddlePaddle](https://github.com/PaddlePaddle/Paddle) platform. Our vision is to empower both industrial application and academic research on speech recognition, via an easy-to-use, efficient, samller and scalable implementation, including training, inference & testing module, and deployment.
## Models
## Features
* [Baidu's DeepSpeech2](http://proceedings.mlr.press/v48/amodei16.pdf)
* [Transformer](https://arxiv.org/abs/1706.03762)
* [Conformer](https://arxiv.org/abs/2005.08100)
* [U2](https://arxiv.org/pdf/2012.05481.pdf)
See [feature list](doc/src/feature_list.md) for more information.
## Setup
* python>=3.7
* paddlepaddle>=2.1.0
Please see [install](doc/install.md).
Please see [install](doc/src/install.md).
## Getting Started
Please see [Getting Started](doc/src/getting_started.md) and [tiny egs](examples/tiny/README.md).
Please see [Getting Started](doc/src/getting_started.md) and [tiny egs](examples/tiny/s0/README.md).
## More Information
* [Install](doc/src/install.md)
* [Getting Started](doc/src/getting_started.md)
* [Data Prepration](doc/src/data_preparation.md)
* [Data Augmentation](doc/src/augmentation.md)
* [Ngram LM](doc/src/ngram_lm.md)
@ -48,7 +43,7 @@ You are welcome to submit questions in [Github Discussions](https://github.com/P
## License
DeepSpeech is provided under the [Apache-2.0 License](./LICENSE).
DeepASR is provided under the [Apache-2.0 License](./LICENSE).
## Acknowledgement

@ -9,12 +9,9 @@
*PaddleASR*是一个采用[PaddlePaddle](https://github.com/PaddlePaddle/Paddle)平台的端到端自动语音识别ASR引擎的开源项目
我们的愿景是为语音识别在工业应用和学术研究上,提供易于使用、高效、小型化和可扩展的工具,包括训练,推理,以及 部署。
## 模型
## 特性
* [Baidu's DeepSpeech2](http://proceedings.mlr.press/v48/amodei16.pdf)
* [Transformer](https://arxiv.org/abs/1706.03762)
* [Conformer](https://arxiv.org/abs/2005.08100)
* [U2](https://arxiv.org/pdf/2012.05481.pdf)
参看 [特性列表](doc/src/feature_list.md)。
## 安装
@ -22,16 +19,14 @@
* python>=3.7
* paddlepaddle>=2.1.0
参看 [安装](doc/install.md)。
参看 [安装](doc/src/install.md)。
## 开始
请查看 [Getting Started](doc/src/getting_started.md) 和 [tiny egs](examples/tiny/README.md)。
请查看 [开始](doc/src/getting_started.md) 和 [tiny egs](examples/tiny/s0/README.md)。
## 更多信息
* [安装](doc/src/install.md)
* [开始](doc/src/getting_started.md)
* [数据处理](doc/src/data_preparation.md)
* [数据增强](doc/src/augmentation.md)
* [语言模型](doc/src/ngram_lm.md)
@ -46,7 +41,7 @@
## License
DeepSpeech遵循[Apache-2.0开源协议](./LICENSE)。
DeepASR 遵循[Apache-2.0开源协议](./LICENSE)。
## 感谢

@ -43,13 +43,11 @@ class DeepSpeech2Trainer(Trainer):
def train_batch(self, batch_index, batch_data, msg):
start = time.time()
loss = self.model(*batch_data)
loss.backward()
layer_tools.print_grads(self.model, print_func=None)
self.optimizer.step()
self.optimizer.clear_grad()
iteration_time = time.time() - start
losses_np = {
@ -274,8 +272,8 @@ class DeepSpeech2Tester(DeepSpeech2Trainer):
infer_model,
input_spec=[
paddle.static.InputSpec(
shape=[None, feat_dim, None],
dtype='float32'), # audio, [B,D,T]
shape=[None, None, feat_dim],
dtype='float32'), # audio, [B,T,D]
paddle.static.InputSpec(shape=[None],
dtype='int64'), # audio_length, [B]
])

@ -0,0 +1,48 @@
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Alignment for U2 model."""
from deepspeech.exps.u2.config import get_cfg_defaults
from deepspeech.exps.u2.model import U2Tester as Tester
from deepspeech.training.cli import default_argument_parser
from deepspeech.utils.utility import print_arguments
def main_sp(config, args):
exp = Tester(config, args)
exp.setup()
exp.run_align()
def main(config, args):
main_sp(config, args)
if __name__ == "__main__":
parser = default_argument_parser()
args = parser.parse_args()
print_arguments(args, globals())
# https://yaml.org/type/float.html
config = get_cfg_defaults()
if args.config:
config.merge_from_file(args.config)
if args.opts:
config.merge_from_list(args.opts)
config.freeze()
print(config)
if args.dump_config:
with open(args.dump_config, 'w') as f:
print(config, file=f)
main(config, args)

@ -351,7 +351,9 @@ class AudioSegment(object):
tfm.set_globals(multithread=False)
tfm.speed(speed_rate)
self._samples = tfm.build_array(
input_array=self._samples, sample_rate_in=self._sample_rate).copy()
input_array=self._samples,
sample_rate_in=self._sample_rate).squeeze(-1).astype(
np.float32).copy()
def normalize(self, target_db=-20, max_gain_db=300.0):
"""Normalize audio to be of the desired RMS value in decibels.

@ -179,8 +179,8 @@ class FeatureNormalizer(object):
wav_number += batch_size
if wav_number % 1000 == 0:
logger.info('process {} wavs,{} frames'.format(wav_number,
all_number))
logger.info(
f'process {wav_number} wavs,{all_number} frames.')
self.cmvn_info = {
'mean_stat': list(all_mean_stat.tolist()),

@ -15,7 +15,7 @@ from paddle import nn
from paddle.nn import functional as F
from deepspeech.modules.activation import brelu
from deepspeech.modules.mask import sequence_mask
from deepspeech.modules.mask import make_non_pad_mask
from deepspeech.utils.log import Log
logger = Log(__name__).getlog()
@ -111,8 +111,10 @@ class ConvBn(nn.Layer):
) // self.stride[1] + 1
# reset padding part to 0
masks = sequence_mask(x_len) #[B, T]
masks = make_non_pad_mask(x_len) #[B, T]
masks = masks.unsqueeze(1).unsqueeze(1) # [B, 1, 1, T]
# TODO(Hui Zhang): not support bool multiply
masks = masks.type_as(x)
x = x.multiply(masks)
return x, x_len

@ -18,40 +18,12 @@ from deepspeech.utils.log import Log
logger = Log(__name__).getlog()
__all__ = [
'sequence_mask', "make_pad_mask", "make_non_pad_mask", "subsequent_mask",
"make_pad_mask", "make_non_pad_mask", "subsequent_mask",
"subsequent_chunk_mask", "add_optional_chunk_mask", "mask_finished_scores",
"mask_finished_preds"
]
def sequence_mask(x_len, max_len=None, dtype='float32'):
"""batch sequence mask.
Args:
x_len ([paddle.Tensor]): xs lenght, [B]
max_len ([type], optional): max sequence length. Defaults to None.
dtype (str, optional): mask data type. Defaults to 'float32'.
Returns:
paddle.Tensor: [B, Tmax]
Examples:
>>> sequence_mask([2, 4])
[[1., 1., 0., 0.],
[1., 1., 1., 1.]]
"""
# (TODO: Hui Zhang): jit not support Tenosr.dim() and Tensor.ndim
# assert x_len.dim() == 1, (x_len.dim(), x_len)
max_len = max_len or x_len.max()
x_len = paddle.unsqueeze(x_len, -1)
row_vector = paddle.arange(max_len)
# TODO(Hui Zhang): fix this bug
#mask = row_vector < x_len
mask = row_vector > x_len # a bug, broadcast 的时候出错了
mask = paddle.cast(mask, dtype)
return mask
def make_pad_mask(lengths: paddle.Tensor) -> paddle.Tensor:
"""Make mask tensor containing indices of padded part.
See description of make_non_pad_mask.
@ -66,7 +38,8 @@ def make_pad_mask(lengths: paddle.Tensor) -> paddle.Tensor:
[0, 0, 0, 1, 1],
[0, 0, 1, 1, 1]]
"""
assert lengths.dim() == 1
# (TODO: Hui Zhang): jit not support Tenosr.dim() and Tensor.ndim
# assert lengths.dim() == 1
batch_size = int(lengths.shape[0])
max_len = int(lengths.max())
seq_range = paddle.arange(0, max_len, dtype=paddle.int64)

@ -19,7 +19,7 @@ from paddle.nn import functional as F
from paddle.nn import initializer as I
from deepspeech.modules.activation import brelu
from deepspeech.modules.mask import sequence_mask
from deepspeech.modules.mask import make_non_pad_mask
from deepspeech.utils.log import Log
logger = Log(__name__).getlog()
@ -306,7 +306,9 @@ class RNNStack(nn.Layer):
"""
for i, rnn in enumerate(self.rnn_stacks):
x, x_len = rnn(x, x_len)
masks = sequence_mask(x_len) #[B, T]
masks = make_non_pad_mask(x_len) #[B, T]
masks = masks.unsqueeze(-1) # [B, T, 1]
# TODO(Hui Zhang): not support bool multiply
masks = masks.type_as(x)
x = x.multiply(masks)
return x, x_len

@ -31,7 +31,7 @@ class ClipGradByGlobalNormWithLog(paddle.nn.ClipGradByGlobalNorm):
def _dygraph_clip(self, params_grads):
params_and_grads = []
sum_square_list = []
for p, g in params_grads:
for i, (p, g) in enumerate(params_grads):
if g is None:
continue
if getattr(p, 'need_clip', True) is False:
@ -45,7 +45,9 @@ class ClipGradByGlobalNormWithLog(paddle.nn.ClipGradByGlobalNorm):
sum_square_list.append(sum_square)
# debug log
# logger.debug(f"Grad Before Clip: {p.name}: {float(sum_square.sqrt()) }")
if i < 10:
logger.debug(
f"Grad Before Clip: {p.name}: {float(sum_square.sqrt()) }")
# all parameters have been filterd out
if len(sum_square_list) == 0:
@ -62,7 +64,7 @@ class ClipGradByGlobalNormWithLog(paddle.nn.ClipGradByGlobalNorm):
clip_var = layers.elementwise_div(
x=max_global_norm,
y=layers.elementwise_max(x=global_norm_var, y=max_global_norm))
for p, g in params_grads:
for i, (p, g) in enumerate(params_grads):
if g is None:
continue
if getattr(p, 'need_clip', True) is False:
@ -72,8 +74,9 @@ class ClipGradByGlobalNormWithLog(paddle.nn.ClipGradByGlobalNorm):
params_and_grads.append((p, new_grad))
# debug log
# logger.debug(
# f"Grad After Clip: {p.name}: {float(merge_grad.square().sum().sqrt())}"
# )
if i < 10:
logger.debug(
f"Grad After Clip: {p.name}: {float(new_grad.square().sum().sqrt())}"
)
return params_and_grads

@ -226,6 +226,7 @@ class Trainer():
'lr': self.lr_scheduler()}, self.epoch)
self.save(tag=self.epoch, infos={'val_loss': cv_loss})
# step lr every epoch
self.lr_scheduler.step()
self.new_epoch()
@ -283,7 +284,6 @@ class Trainer():
"""
# visualizer
visualizer = SummaryWriter(logdir=str(self.output_dir))
self.visualizer = visualizer
@mp_tools.rank_zero_only
@ -301,7 +301,6 @@ class Trainer():
"""
raise NotImplementedError("train_batch should be implemented.")
@mp_tools.rank_zero_only
@paddle.no_grad()
def valid(self):
"""The validation. A subclass should implement this method.

Binary file not shown.

After

Width:  |  Height:  |  Size: 46 KiB

@ -0,0 +1,20 @@
# Alignment
我们首先从建模的角度理解一下对齐。语音识别任务,需要对输入音频序列 X = [x1x2x3...xt...xT] (通常是 fbank 或 mfcc 等音频特征)和输出的标注数据文本序列 Y = [y1y2y3...yu...yU] 关系进行建模,其中 X 的长度一般大于 Y 的长度。如果能够知道yu和xt的对应关系就可以将这类任务变成语音帧级别上的分类任务即对每个时刻 xt 进行分类得到 yu。
## MFA
## CTC Alignment
## Reference
* [ctc alignment](https://mp.weixin.qq.com/s/4aGehNN7PpIvCh03qTT5oA)
* [时间戳和N-Best](https://mp.weixin.qq.com/s?__biz=MzU2NjUwMTgxOQ==&mid=2247483956&idx=1&sn=80ce595238d84155d50f08c0d52267d3&chksm=fcaacae0cbdd43f62b1da60c8e8671a9e0bb2aeee94f58751839b03a1c45b9a3889b96705080&scene=21#wechat_redirect)

@ -1,4 +1,4 @@
# ASR PostProcess
# ASR Text Backend
1. [Text Segmentation](text_front_end#text segmentation)
2. Text Corrector

@ -4,7 +4,7 @@
We compare the training time with 1, 2, 4, 8 Tesla V100 GPUs (with a subset of LibriSpeech samples whose audio durations are between 6.0 and 7.0 seconds). And it shows that a **near-linear** acceleration with multiple GPUs has been achieved. In the following figure, the time (in seconds) cost for training is printed on the blue bars.
<img src="../images/multi_gpu_speedup.png" width=450><br/>
<img src="../images/multi_gpu_speedup.png" width=450>
| # of GPU | Acceleration Rate |
| -------- | --------------: |

@ -13,8 +13,6 @@
There are a total of 410 common pinyin syllables.
* [Rare syllable](https://resources.allsetlearning.com/chinese/pronunciation/Rare_syllable)
* [Chinese Pronunciation: The Complete Guide for Beginner](https://www.digmandarin.com/chinese-pronunciation-guide.html)
@ -51,3 +49,22 @@
* [Bopomofo](https://en.wikipedia.org/wiki/Bopomofo)
* [Zhuyin table](https://en.wikipedia.org/wiki/Zhuyin_table)
## Tone sandhi
* https://zh.wikipedia.org/wiki/%E8%AE%8A%E8%AA%BF
* https://github.com/mozillazg/python-pinyin/issues/133
pypinyin关于变调错误的评估
## tools
* https://github.com/KuangDD/phkit
* https://github.com/mozillazg/python-pinyin
* https://github.com/Kyubyong/g2pC
* https://github.com/kakaobrain/g2pM

@ -13,3 +13,9 @@
* [Tatoeba](https://tatoeba.org/cmn)
**Tatoeba is a collection of sentences and translations.** It's collaborative, open, free and even addictive. An open data initiative aimed at translation and speech recognition.
### ASR Noise
* [asr-noises](https://github.com/speechio/asr-noises)

@ -0,0 +1,5 @@
# Decoding
## Reference
* [时间戳和N-Best](https://mp.weixin.qq.com/s?__biz=MzU2NjUwMTgxOQ==&mid=2247483956&idx=1&sn=80ce595238d84155d50f08c0d52267d3&chksm=fcaacae0cbdd43f62b1da60c8e8671a9e0bb2aeee94f58751839b03a1c45b9a3889b96705080&scene=21#wechat_redirect)

@ -0,0 +1,61 @@
# Featrues
### Speech Recognition
* Offline
* [Baidu's DeepSpeech2](http://proceedings.mlr.press/v48/amodei16.pdf)
* [Transformer](https://arxiv.org/abs/1706.03762)
* [Conformer](https://arxiv.org/abs/2005.08100)
* Online
* [U2](https://arxiv.org/pdf/2012.05481.pdf)
### Language Model
* Ngram
### Decoder
* ctc greedy
* ctc prefix beam search
* greedy
* beam search
* attention rescore
### Speech Frontend
* Audio
* Auto Gain
* Feature
* kaldi fbank
* kaldi mfcc
* linear
* delta detla
### Speech Augmentation
* Audio
- Volume Perturbation
- Speed Perturbation
- Shifting Perturbation
- Online Bayesian normalization
- Noise Perturbation
- Impulse Response
* Spectrum
- SpecAugment
- Adaptive SpecAugment
### Tokenizer
* Chinese/English Character
* English Word
* Sentence Piece
### Word Segmentation
* [mmseg](http://technology.chtsai.org/mmseg/)
### Grapheme To Phoneme
* syallable
* phoneme

@ -0,0 +1,258 @@
# Praat and TextGrid
* [**Praat: doing phonetics by computer**](https://www.fon.hum.uva.nl/praat/)
* [TextGrid](https://github.com/kylebgorman/textgrid)
## Praat
**Praat语音学软件**,原名**Praat: doing phonetics by computer**,通常简称**Praat**,是一款[跨平台](https://zh.wikipedia.org/wiki/跨平台)的多功能[语音学](https://zh.wikipedia.org/wiki/语音学)专业[软件](https://zh.wikipedia.org/wiki/软件),主要用于对[数字化](https://zh.wikipedia.org/wiki/数字化)的[语音](https://zh.wikipedia.org/wiki/语音)[信号](https://zh.wikipedia.org/wiki/信号)进行[分析](https://zh.wikipedia.org/w/index.php?title=语音分析&action=edit&redlink=1)、标注、[处理](https://zh.wikipedia.org/wiki/数字信号处理)及[合成](https://zh.wikipedia.org/wiki/语音合成)等实验,同时生成各种[语图](https://zh.wikipedia.org/w/index.php?title=语图&action=edit&redlink=1)和文字报表。
<img src="../images/Praat-output-showing-forced-alignment-for-Mae-hen-wlad-fy-nhadau-the-first-line-of-the.png" width=650>
## TextGrid
### TextGrid文件结构
```text
第一行是固定的:File type = "ooTextFile"
第二行也是固定的:Object class = "TextGrid"
空一行
xmin = xxxx.xxxx  # 表示开始时间
xmax = xxxx.xxxx  # 表示结束时间
tiers? <exists>  # 这一行固定
size = 4  # 表示这个文件有几个item, item也叫tiers, 可以翻译为'层', 这个值是几,就表示有几个item
item []:
    item [1]:
        class = "IntervalTier"
        name = "phone"
        xmin = 1358.8925
        xmax = 1422.5525
        intervals: size = 104
        intervals [1]:
            xmin = 1358.8925
            xmax = 1361.8925
            text = "sil"
        intervals [2]:
            xmin = 1361.8925
            xmax = 1362.0125
            text = "R"
        intervals [3]:
            ...
        intervals [104]:
            xmin = 1422.2325
            xmax = 1422.5525
            text = "sil"
    item [2]:
        class = "IntervalTier"
        name = "word"
        xmin = 1358.8925
        xmax = 1422.5525
        intervals: size = 3
        intervals [1]:
            xmin = 1358.8925
            xmax = 1361.8925
            text = "sp"
```
textgrid 文件中的 size 的值是几就表示有几个 item 每个 item 下面包含 class, name, xmin, xmax, intervals 的键值对item 中的 intervals: size 是几就表示这个 item 中有几个 intervals每个 intervals 有 xmin, xmax, text 三个键值参数。所有 item 中的 xmax - xmin 的值是一样的。
### 安装
```python
pip3 install textgrid
```
### 使用
1. 读一个textgrid文件
```python
import textgrid
tg = textgrid.TextGrid()
tg.read('file.TextGrid') # 'file.TextGrid' 是文件名
```
tg.tiers属性:
会把文件中的所有item打印出来, print(tg.tiers) 的结果如下:
```text
[IntervalTier(
phone, [
Interval(1358.89250, 1361.89250, sil),
Interval(1361.89250, 1362.01250, R),
Interval(1362.01250, 1362.13250, AY1),
Interval(1362.13250, 1362.16250, T),
...
]
)
]
```
此外, tg.tiers[0] 表示第一个 IntervalTier, 支持继续用中括号取序列, '.'来取属性.
比如:
```text
tg.tiers[0][0].mark --> 'sil'
tg.tiers[0].name --> 'phone'
tg.tiers[0][0].minTime --> 1358.8925
tg.tiers[0].intervals --> [Interval(1358.89250, 1361.89250, sil), ..., Interval(1422.23250, 1422.55250, sil)]
tg.tiers[0].maxTime --> 1422.55250
```
TextGrid 模块中包含四种对象
```
PointTier 可以理解为标记(点)的集合
IntervalTier 可以理解为时长(区间)的集合
Point 可以理解为标记
Interval 可以理解为时长
```
2. textgrid库中的对象
**IntervalTier** 对象:
方法
```
add(minTime, maxTime, mark): 添加一个标记,需要同时传入起止时间, 和mark的名字.
addInterval(interval): 添加一个Interval对象, 该Interval对象中已经封装了起止时间.
remove(minTime, maxTime, mark): 删除一个Interval
removeInterval(interval): 删除一个Interval
indexContaining(time): 传入时间或Point对象, 返回包含该时间的Interval对象的下标
例如:
print(tg[0].indexContaining(1362)) --> 1
表示tg[0] 中包含1362时间点的是 下标为1的 Interval 对象
intervalContaining(): 传入时间或Point对象, 返回包含该时间的Interval对象
例如
print(tg[0].intervalContaining(1362)) --> Interval(1361.89250, 1362.01250, R)
read(f): f是文件对象, 读一个TextGrid文件
write(f): f是文件对象, 写一个TextGrid文件
fromFile(f_path): f_path是文件路径, 从一个文件读
bounds(): 返回一个元组, (minTime, maxTime)
```
属性
```
intervals --> 返回所有的 interval 的列表
maxTime --> 返回 number(decimal.Decimal)类型, 表示结束时间
minTime --> 返回 number(decimal.Decimal)类型, 表示开始时间
name --> 返回字符串
strict -- > 返回bool值, 表示是否严格TextGrid格式
```
**PointTier** 对象:
方法
```
add(minTime, maxTime, mark): 添加一个标记,需要同时传入起止时间, 和mark的名字.
addPoint(point): 添加一个Point对象, 该Point对象中已经封装了起止时间.
remove(time, mark): 删除一个 point, 传入时间和mark
removePoint(point): 删除一个 point, 传入point对象
read(f): 读, f是文件对象
write(f): 写, f是文件对象
fromFile(f_path): f_path是文件路径, 从一个文件读
bounds(): 返回一个元组, (minTime, maxTime)
```
属性
```
points 返回所有的 point 的列表
maxTime 和IntervalTier一样, 返回结束时间
minTime 和IntervalTier一样, 返回开始时间
name 返回name
```
**Point** 对象:
支持比较大小, 支持加减运算
属性:
```
mark:
time:
```
**Interval** 对象:
支持比较大小, 支持加减运算
支持 in, not in 的运算
方法:
```
duration(): 返回number 类型, 表示这个Interval的持续时间
bounds(): --> 返回元组, (minTime, maxTime)
overlaps(Interval): --> 返回bool值, 判断本Interval的时间和传入的的Interval的时间是否重叠, 是返回True
```
属性:
```
mark
maxTime
minTime
strick: --> 返回bool值, 判断格式是否严格的TextGrid格式
```
**TextGrid** 对象:
支持列表的取值,复制, 迭代, 求长度, append, extend, pop方法
方法:
```
getFirst(tierName) 返回第一个名字为tierName的tier
getList(tierName) 返回名字为tierName的tier的列表
getNames() 返回所有tier的名字的列表
append(tier) 添加一个tier作为其中的元素
extend(tiers) 添加多个tier作为其中的元素
pop(tier) 删除一个tier
read(f) f是文件对象
write(f) f是文件对象
fromFile(f_path) f_path是文件路径
```
属性:
```
maxTime
minTime
name
strict
tiers 返回所有tiers的列表
```
**MLF** 对象
MLF('xxx.mlf')
'xxx.mlf'为mlf格式的文件,
读取hvite-o sm生成的htk.mlf文件并将其转换为 TextGrid的列表
方法:
```
read(f) f是文件对象
write(prefix='') prefix是写出路径的前缀,可选
```
属性:
```
grids: --> 返回读取的grids的列表
```
## Reference
* https://zh.wikipedia.org/wiki/Praat%E8%AF%AD%E9%9F%B3%E5%AD%A6%E8%BD%AF%E4%BB%B6
* https://blog.csdn.net/duxin_csdn/article/details/88966295

@ -0,0 +1,3 @@
# Useful Tools
* [正则可视化和常用正则表达式](https://wangwl.net/static/projects/visualRegex/#)

@ -13,25 +13,74 @@ There are various libraries including some of the most popular ones like NLTK, S
## Text Normalization(文本正则)
The **basic preprocessing steps** that occur in English NLP, including data cleaning, stemming/lemmatization, tokenization and stop words. **not all of these steps are necessary for Chinese text data!**
### Lexicon Normalization
Theres a concept similar to stems in this language, and theyre called Radicals. **Radicals are basically the building blocks of Chinese characters.** All Chinese characters are made up of a finite number of components which are put together in different orders and combinations. Radicals are usually the leftmost part of the character. There are around 200 radicals in Chinese, and they are used to index and categorize characters.
Therefore, procedures like stemming and lemmatization are not useful for Chinese text data because seperating the radicals would **change the words meaning entirely**.
### Tokenization
**Tokenizing breaks up text data into shorter pre-set strings**, which help build context and meaning for the machine learning model.
These “tags” label the part of speech. There are 24 part of speech tags and 4 proper name category labels in the `**jieba**` packages existing dictionary.
<img src="../images/jieba_tags.png" width=650>
### Stop Words
In NLP, **stop words are “meaningless” words** that make the data too noisy or ambiguous.
Instead of manually removing them, you could import the `**stopwordsiso**` package for a full list of Chinese stop words. More information can be found [here](https://pypi.org/project/stopwordsiso/). And with this, we can easily create code to filter out any stop words in large text data.
```python
!pip install stopwordsiso
import stopwordsiso
from stopwordsiso import stopwords
stopwords(["zh"]) # Chinese
```
文本正则化 文本正则化主要是讲非标准词(NSW)进行转化,比如:
数字、电话号码: 10086 -> 一千零八十六/幺零零八六
时间,比分: 23:20 -> 二十三点二十分/二十三比二十
分数、小数、百分比: 3/4 -> 四分之三3.24 -> 三点一四, 15% -> 百分之十五
符号、单位: ¥ -> 元, kg -> 千克
网址、文件后缀: www. -> 三W点
1. 数字、电话号码: 10086 -> 一千零八十六/幺零零八六
2. 时间,比分: 23:20 -> 二十三点二十分/二十三比二十
3. 分数、小数、百分比: 3/4 -> 四分之三3.24 -> 三点一四, 15% -> 百分之十五
4. 符号、单位: ¥ -> 元, kg -> 千克
5. 网址、文件后缀: www. -> 三W点
其他转换:
1. 简体和繁体转换:中国语言 -> 中國語言
2. 半角和全角准换:, ->
### tools
* https://github.com/google/re2
* https://github.com/speechio/chinese_text_normalization
* [vinorm](https://github.com/NoahDrisort/vinorm) [cpp_verion](https://github.com/NoahDrisort/vinorm_cpp_version)
Python package for text normalization, use for frontend of Text-to-speech Reseach
* https://github.com/candlewill/CNTN
This is a ChiNese Text Normalization (CNTN) tool for Text-to-speech system, which is based on [sparrowhawk](https://github.com/google/sparrowhawk).
* [Simplified and Traditional Chinese Characters converter](https://github.com/berniey/hanziconv)
* [Halfwidth and Fullwidth](https://zh.wikipedia.org/wiki/%E5%85%A8%E5%BD%A2%E5%92%8C%E5%8D%8A%E5%BD%A2)
## Word Segmentation(分词)
分词之所以重要可以通过这个例子来说明:
广州市长隆马戏欢迎你 -> 广州市 长隆 马戏 欢迎你
如果没有分词错误会导致句意完全不正确: 
广州市长隆马戏欢迎你 -> 广州市 长隆 马戏 欢迎你
如果没有分词错误会导致句意完全不正确:
广州 市长 隆马戏 欢迎你
分词常用方法分为最大前向匹配(基于字典)和基于CRF的分词方法。用CRF的方法相当于是把这个任务转换成了序列标注相比于基于字典的方法好处是对于歧义或者未登录词有较强的识别能力缺点是不能快速fix bug并且性能略低于词典。
@ -42,6 +91,7 @@ There are various libraries including some of the most popular ones like NLTK, S
* https://github.com/thunlp/THULAC-Python
* https://github.com/fxsjy/jieba
* CRF++
* https://github.com/isnowfy/snownlp
### MMSEG
* [MMSEG: A Word Identification System for Mandarin Chinese Text Based on Two Variants of the Maximum Matching Algorithm](http://technology.chtsai.org/mmseg/)
@ -63,11 +113,14 @@ There are various libraries including some of the most popular ones like NLTK, S
## Part of Speech(词性预测)
词性解释
```
n/名词 np/人名 ns/地名 ni/机构名 nz/其它专名
m/数词 q/量词 mq/数量词 t/时间词 f/方位词 s/处所词
v/动词 a/形容词 d/副词 h/前接成分 k/后接成分
i/习语 j/简称 r/代词 c/连词 p/介词 u/助词 y/语气助词
e/叹词 o/拟声词 g/语素 w/标点 x/其它
```
@ -77,7 +130,10 @@ e/叹词 o/拟声词 g/语素 w/标点 x/其它
传统方法是使用字典,但是对于未登录词就很难解决。基于模型的方法是使用 [Phonetisaurus](https://github.com/AdolfVonKleist/Phonetisaurus)。 论文可以参考 - WFST-based Grapheme-to-Phoneme Conversion: Open Source Tools for Alignment, Model-Building and Decoding
当然这个问题也可以看做是序列标注用CRF或者基于神经网络的模型都可以做。 基于神经网络工具: [g2pM](https://github.com/kakaobrain/g2pM)。
当然这个问题也可以看做是序列标注用CRF或者基于神经网络的模型都可以做。 基于神经网络工具:
* https://github.com/kakaobrain/g2pM
* https://github.com/Kyubyong/g2p
@ -86,23 +142,25 @@ e/叹词 o/拟声词 g/语素 w/标点 x/其它
ToBI(an abbreviation of tones and break indices) is a set of conventions for transcribing and annotating the prosody of speech. 中文主要关注break。
韵律等级结构:
音素 -> 音节 -> 韵律词(Prosody Word, PW) -> 韵律短语(prosody phrase, PPH) -> 语调短句(intonational phrase, IPH) -> 子句子 -> 主句子 -> 段落 -> 篇章
```
音素 -> 音节 -> 韵律词(Prosody Word, PW) -> 韵律短语(prosody phrase, PPH) -> 语调短句(intonational phrase, IPH) -> 子句子 -> 主句子 -> 段落 -> 篇章
LP -> LO -> L1(#1) -> L2(#2) -> L3(#3) -> L4(#4) -> L5 -> L6 -> L7
```
主要关注 PW, PPH, IPH
| | 停顿时长 | 前后音高特征 |
| --- | ----------| --- |
| 韵律词边界 | 不停顿或从听感上察觉不到停顿 | 无 |
| 韵律词边界 | 不停顿或从听感上察觉不到停顿 | 无 |
| 韵律短语边界 | 可以感知停顿,但无明显的静音段 | 音高不下倾或稍下倾,韵末不可做句末 |
| 语调短语边界 | 有较长停顿 | 音高下倾比较完全,韵末可以作为句末 |
常用方法使用的是级联CRF首先预测如果是PW再继续预测是否是PPH再预测是否是IPH
<img src="../images/prosody.jpeg" width=450><br/>
常用方法使用的是级联CRF首先预测如果是PW再继续预测是否是PPH再预测是否是IPH
<img src="../images/prosody.jpeg" width=450>
论文: 2015 .Ding Et al. - Automatic Prosody Prediction For Chinese Speech Synthesis Using BLSTM-RNN and Embedding Features
@ -116,7 +174,7 @@ LP -> LO -> L1(#1) -> L2(#2) -> L3(#3) -> L4(#4) -> L5 -> L6 -> L7
## 基于神经网络的前端文本分析模型
## 基于神经网络的前端文本分析模型
最近这两年基本都是基于 BERT所以这里记录一下相关的论文:
@ -128,6 +186,8 @@ LP -> LO -> L1(#1) -> L2(#2) -> L3(#3) -> L4(#4) -> L5 -> L6 -> L7
## 总结
总结一下,文本分析各个模块的方法:
@ -148,3 +208,5 @@ TN: 基于规则的方法
## Reference
* [Text Front End](https://slyne.github.io/%E5%85%AC%E5%BC%80%E8%AF%BE/2020/10/03/TTS1/)
* [Chinese Natural Language (Pre)processing: An Introduction](https://towardsdatascience.com/chinese-natural-language-pre-processing-an-introduction-995d16c2705f)
* [Beginners Guide to Sentiment Analysis for Simplified Chinese using SnowNLP](https://towardsdatascience.com/beginners-guide-to-sentiment-analysis-for-simplified-chinese-using-snownlp-ce88a8407efb)

@ -0,0 +1,31 @@
# VAD
## Endpoint Detection
### Kaldi
**Kaldi**使用规则方式,制定了五条规则,只要满足其中一条则认为是检测到了 endpoint。
1. 识别出文字之前,检测到了 5s 的静音;
2. 识别出文字之后,检测到了 2s 的静音;
3. 解码到概率较小的 final state且检测到了 1s 的静音;
4. 解码到概率较大的 final state且检测到了 0.5s 的静音;
5. 已经解码了 20s。
### CTC
将连续的长 blank 标签,视为非语音区域。非语音区域满足一定的条件,即可认为是检测到了 endpoint。同时参考 Kaldi 的 `src/online2/online-endpoint.h`,制定了以下三条规则:
1. 识别出文字之前,检测到了 5s 的静音;
2. 识别出文字之后,检测到了 1s 的静音;
3. 已经解码了 20s。
只要满足上述三条规则中的任意一条, 就认为检测到了 endpoint。
## Reference
* [Endpoint 检测](https://mp.weixin.qq.com/s?__biz=MzU2NjUwMTgxOQ==&mid=2247484024&idx=1&sn=12da2ee76347de4a18856274ba6ba61f&chksm=fcaacaaccbdd43ba6b3e996bbf1e2ac6d5f1b449dfd80fcaccfbbe0a240fa1668b931dbf4bd5&scene=21#wechat_redirect)
* Kaldi: *https://github.com/kaldi-asr/kaldi/blob/6260b27d146e466c7e1e5c60858e8da9fd9c78ae/src/online2/online-endpoint.h#L132-L150*
* End-to-End Automatic Speech Recognition Integrated with CTC-Based Voice Activity Detection: *https://arxiv.org/pdf/2002.00551.pdf*

@ -1,6 +1,6 @@
export MAIN_ROOT=${PWD}
export PATH=${MAIN_ROOT}:${PWD}/tools:${PATH}
export PATH=${MAIN_ROOT}:${MAIN_ROOT}/utils:${PATH}
export LC_ALL=C
# Use UTF-8 in Python to avoid UnicodeDecodeError when LC_ALL=C

@ -1,7 +1,8 @@
# Aishell-1
## CTC
| Model | Config | Test set | CER |
| --- | --- | --- | --- |
| DeepSpeech2 | conf/deepspeech2.yaml | test | 0.078977 |
| DeepSpeech2 | release 1.8.5 | test | 0.080447 |
## Deepspeech2
| Model | release | Config | Test set | CER |
| --- | --- | --- | --- | --- |
| DeepSpeech2 | 2.1 | conf/deepspeech2.yaml | test | 0.078671 |
| DeepSpeech2 | 2.0 | conf/deepspeech2.yaml | test | 0.078977 |
| DeepSpeech2 | 1.8.5 | - | test | 0.080447 |

@ -1,4 +1,13 @@
[
{
"type": "speed",
"params": {
"min_speed_rate": 0.9,
"max_speed_rate": 1.1,
"num_rates": 3
},
"prob": 0.0
},
{
"type": "shift",
"params": {

@ -10,9 +10,9 @@ data:
min_input_len: 0.0
max_input_len: 27.0 # second
min_output_len: 0.0
max_output_len: 400.0
min_output_input_ratio: 0.05
max_output_input_ratio: 10.0
max_output_len: .inf
min_output_input_ratio: 0.00
max_output_input_ratio: .inf
specgram_type: linear
target_sample_rate: 16000
max_freq: None
@ -41,7 +41,7 @@ training:
lr: 2e-3
lr_decay: 0.83
weight_decay: 1e-06
global_grad_clip: 5.0
global_grad_clip: 3.0
log_interval: 100
decoding:

@ -1,23 +0,0 @@
#! /usr/bin/env bash
if [ $# != 2 ];then
echo "usage: ${0} ckpt_dir avg_num"
exit -1
fi
ckpt_dir=${1}
average_num=${2}
decode_checkpoint=${ckpt_dir}/avg_${average_num}.pdparams
python3 -u ${MAIN_ROOT}/utils/avg_model.py \
--dst_model ${decode_checkpoint} \
--ckpt_dir ${ckpt_dir} \
--num ${average_num} \
--val_best
if [ $? -ne 0 ]; then
echo "Failed in avg ckpt!"
exit 1
fi
exit 0

@ -32,7 +32,7 @@ if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
--unit_type="char" \
--count_threshold=0 \
--vocab_path="data/vocab.txt" \
--manifest_paths "data/manifest.train.raw"
--manifest_paths "data/manifest.train.raw" "data/manifest.dev.raw"
if [ $? -ne 0 ]; then
echo "Build vocabulary failed. Terminated."
@ -51,8 +51,8 @@ if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
--stride_ms=10.0 \
--window_ms=20.0 \
--sample_rate=16000 \
--use_dB_normalization=False \
--num_samples=-1 \
--use_dB_normalization=True \
--num_samples=2000 \
--num_workers=${num_workers} \
--output_path="data/mean_std.json"

@ -1,6 +1,6 @@
export MAIN_ROOT=${PWD}/../../../
export PATH=${MAIN_ROOT}:${PWD}/tools:${PATH}
export PATH=${MAIN_ROOT}:${MAIN_ROOT}/utils:${PATH}
export LC_ALL=C
# Use UTF-8 in Python to avoid UnicodeDecodeError when LC_ALL=C

@ -2,13 +2,14 @@
## Conformer
| Model | Config | Augmentation| Test set | Decode method | Loss | WER |
| --- | --- | --- | --- | --- | --- |
| --- | --- | --- | --- | --- | --- | --- |
| conformer | conf/conformer.yaml | spec_aug + shift | test | attention | - | 0.059858 |
| conformer | conf/conformer.yaml | spec_aug + shift | test | ctc_greedy_search | - | 0.062311 |
| conformer | conf/conformer.yaml | spec_aug + shift | test | ctc_prefix_beam_search | - | 0.062196 |
| conformer | conf/conformer.yaml | spec_aug + shift | test | attention_rescoring | - | 0.054694 |
## Transformer
| Model | Config | Augmentation| Test set | Decode method | Loss | WER |
| --- | --- | --- | --- | --- | --- |
| --- | --- | --- | --- | --- | --- | ---|
| transformer | conf/transformer.yaml | spec_aug + shift | test | attention | - | - |

@ -24,7 +24,7 @@ data:
n_fft: None
stride_ms: 10.0
window_ms: 25.0
use_dB_normalization: False
use_dB_normalization: True
target_dB: -20
random_seed: 0
keep_transcription_text: False
@ -76,7 +76,7 @@ model:
training:
n_epoch: 240
accum_grad: 2
global_grad_clip: 5.0
global_grad_clip: 3.0
optim: adam
optim_conf:
lr: 0.002

@ -1,23 +0,0 @@
#! /usr/bin/env bash
if [ $# != 2 ]; then
echo "usage: ${0} ckpt_dir avg_num"
exit -1
fi
ckpt_dir=${1}
average_num=${2}
decode_checkpoint=${ckpt_dir}/avg_${average_num}.pdparams
python3 -u ${MAIN_ROOT}/utils/avg_model.py \
--dst_model ${decode_checkpoint} \
--ckpt_dir ${ckpt_dir} \
--num ${average_num} \
--val_best
if [ $? -ne 0 ]; then
echo "Failed in avg ckpt!"
exit 1
fi
exit 0

@ -15,21 +15,26 @@ fi
config_path=$1
ckpt_prefix=$2
ckpt_name=$(basename ${ckpt_prefxi})
mkdir -p exp
# download language model
#bash local/download_lm_ch.sh
#if [ $? -ne 0 ]; then
# exit 1
#fi
for type in attention ctc_greedy_search; do
echo "decoding ${type}"
batch_size=64
output_dir=${ckpt_prefix}
mkdir -p ${output_dir}
python3 -u ${BIN_DIR}/test.py \
--device ${device} \
--nproc 1 \
--config ${config_path} \
--result_file ${ckpt_prefix}.${type}.rsl \
--result_file ${output_dir}/${type}.rsl \
--checkpoint_path ${ckpt_prefix} \
--opts decoding.decoding_method ${type} decoding.batch_size ${batch_size}
@ -42,11 +47,13 @@ done
for type in ctc_prefix_beam_search attention_rescoring; do
echo "decoding ${type}"
batch_size=1
output_dir=${ckpt_prefix}
mkdir -p ${output_dir}
python3 -u ${BIN_DIR}/test.py \
--device ${device} \
--nproc 1 \
--config ${config_path} \
--result_file ${ckpt_prefix}.${type}.rsl \
--result_file ${output_dir}/${type}.rsl \
--checkpoint_path ${ckpt_prefix} \
--opts decoding.decoding_method ${type} decoding.batch_size ${batch_size}

@ -1,6 +1,6 @@
export MAIN_ROOT=${PWD}/../../../
export PATH=${MAIN_ROOT}:${PWD}/tools:${PATH}
export PATH=${MAIN_ROOT}:${MAIN_ROOT}/utils:${PATH}
export LC_ALL=C
# Use UTF-8 in Python to avoid UnicodeDecodeError when LC_ALL=C

@ -0,0 +1,2 @@
data
exp

@ -0,0 +1,85 @@
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# https://github.com/rubber-duck-dragon/rubber-duck-dragon.github.io/blob/master/cc-cedict_parser/parser.py
#A parser for the CC-Cedict. Convert the Chinese-English dictionary into a list of python dictionaries with "traditional","simplified", "pinyin", and "english" keys.
#Make sure that the cedict_ts.u8 file is in the same folder as this file, and that the name matches the file name on line 13.
#Before starting, open the CEDICT text file and delete the copyright information at the top. Otherwise the program will try to parse it and you will get an error message.
#Characters that are commonly used as surnames have two entries in CC-CEDICT. This program will remove the surname entry if there is another entry for the character. If you want to include the surnames, simply delete lines 59 and 60.
#This code was written by Franki Allegra in February 2020.
import json
import sys
# usage: bin ccedict dump.json
with open(sys.argv[1], 'rt') as file:
text = file.read()
lines = text.split('\n')
dict_lines = list(lines)
def parse_line(line):
parsed = {}
if line == '':
dict_lines.remove(line)
return 0
if line.startswith('#'):
return 0
if line.startswith('%'):
return 0
line = line.rstrip('/')
line = line.split('/')
if len(line) <= 1:
return 0
english = line[1]
char_and_pinyin = line[0].split('[')
characters = char_and_pinyin[0]
characters = characters.split()
traditional = characters[0]
simplified = characters[1]
pinyin = char_and_pinyin[1]
pinyin = pinyin.rstrip()
pinyin = pinyin.rstrip("]")
parsed['traditional'] = traditional
parsed['simplified'] = simplified
parsed['pinyin'] = pinyin
parsed['english'] = english
list_of_dicts.append(parsed)
def remove_surnames():
for x in range(len(list_of_dicts) - 1, -1, -1):
if "surname " in list_of_dicts[x]['english']:
if list_of_dicts[x]['traditional'] == list_of_dicts[x + 1][
'traditional']:
list_of_dicts.pop(x)
def main():
#make each line into a dictionary
print("Parsing dictionary . . .")
for line in dict_lines:
parse_line(line)
#remove entries for surnames from the data (optional):
print("Removing Surnames . . .")
remove_surnames()
print("Saving to database (this may take a few minutes) . . .")
with open(sys.argv[2], 'wt') as fout:
for one_dict in list_of_dicts:
json_str = json.dumps(one_dict)
fout.write(json_str + "\n")
print('Done!')
list_of_dicts = []
parsed_dict = main()

@ -0,0 +1,10 @@
export MAIN_ROOT=${PWD}/../../
export PATH=${MAIN_ROOT}:${MAIN_ROOT}/utils:${PATH}
export LC_ALL=C
# Use UTF-8 in Python to avoid UnicodeDecodeError when LC_ALL=C
export PYTHONIOENCODING=UTF-8
export PYTHONPATH=${MAIN_ROOT}:${PYTHONPATH}
export LD_LIBRARY_PATH=/usr/local/lib/:${LD_LIBRARY_PATH}

@ -0,0 +1,39 @@
#!/bin/bash
# CC-CEDICT download: https://www.mdbg.net/chinese/dictionary?page=cc-cedict
# The word dictionary of this website is based on CC-CEDICT.
# CC-CEDICT is a continuation of the CEDICT project started by Paul Denisowski in 1997 with the
# aim to provide a complete downloadable Chinese to English dictionary with pronunciation in pinyin for the Chinese characters.
# This website allows you to easily add new entries or correct existing entries in CC-CEDICT.
# Submitted entries will be checked and processed frequently and released for download in CEDICT format on this page.
set -e
source path.sh
stage=-1
stop_stage=100
source ${MAIN_ROOT}/utils/parse_options.sh || exit -1
cedict_url=https://www.mdbg.net/chinese/export/cedict/cedict_1_0_ts_utf-8_mdbg.zip
cedict=cedict_1_0_ts_utf-8_mdbg.zip
mkdir -p data
if [ $stage -le -1 ] && [ $stop_stage -ge -1 ];then
test -f data/${cedict} || wget -O data/${cedict} ${cedict_url}
pushd data
unzip ${cedict}
popd
fi
mkdir -p exp
if [ $stage -le 0 ] && [ $stop_stage -ge 0 ];then
cp data/cedict_ts.u8 exp/cedict
python3 local/parser.py exp/cedict exp/cedict.json
fi

@ -0,0 +1,2 @@
data
exp

@ -0,0 +1,5 @@
# Download Baker dataset
Baker dataset has to be downloaded mannually and moved to 'data/', because you will have to pass the CATTCHA from a browswe to download the dataset.
Download URL https://test.data-baker.com/#/data/index/source.

@ -0,0 +1,53 @@
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import argparse
import re
import jieba
from pypinyin import lazy_pinyin
from pypinyin import Style
def extract_pinyin(source, target, use_jieba=False):
with open(source, 'rt', encoding='utf-8') as fin:
with open(target, 'wt', encoding='utf-8') as fout:
for i, line in enumerate(fin):
if i % 2 == 0:
sentence_id, raw_text = line.strip().split()
raw_text = re.sub(r'#\d', '', raw_text)
if use_jieba:
raw_text = jieba.lcut(raw_text)
syllables = lazy_pinyin(
raw_text,
errors='ignore',
style=Style.TONE3,
neutral_tone_with_five=True)
transcription = ' '.join(syllables)
fout.write(f'{sentence_id} {transcription}\n')
else:
continue
if __name__ == "__main__":
parser = argparse.ArgumentParser(description="extract baker pinyin labels")
parser.add_argument(
"input", type=str, help="source file of baker's prosody label file")
parser.add_argument(
"output", type=str, help="target file to write pinyin lables")
parser.add_argument(
"--use-jieba",
action='store_true',
help="use jieba for word segmentation.")
args = parser.parse_args()
extract_pinyin(args.input, args.output, use_jieba=args.use_jieba)

@ -0,0 +1,37 @@
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import argparse
def extract_pinyin_lables(source, target):
"""Extract pinyin labels from Baker's prosody labeling."""
with open(source, 'rt', encoding='utf-8') as fin:
with open(target, 'wt', encoding='utf-8') as fout:
for i, line in enumerate(fin):
if i % 2 == 0:
sentence_id, raw_text = line.strip().split()
fout.write(f'{sentence_id} ')
else:
transcription = line.strip()
fout.write(f'{transcription}\n')
if __name__ == "__main__":
parser = argparse.ArgumentParser(description="extract baker pinyin labels")
parser.add_argument(
"input", type=str, help="source file of baker's prosody label file")
parser.add_argument(
"output", type=str, help="target file to write pinyin lables")
args = parser.parse_args()
extract_pinyin_lables(args.input, args.output)

@ -0,0 +1,100 @@
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import argparse
from typing import List, Union
from pathlib import Path
def erized(syllable: str) -> bool:
"""Whether the syllable contains erhua effect.
Example
--------
huar -> True
guanr -> True
er -> False
"""
# note: for pinyin, len(syllable) >=2 is always true
# if not: there is something wrong in the data
assert len(syllable) >= 2, f"inavlid syllable {syllable}"
return syllable[:2] != "er" and syllable[-2] == 'r'
def ignore_sandhi(reference: List[str], generated: List[str]) -> List[str]:
"""
Given a sequence of syllables from human annotation(reference),
which makes sandhi explici and a sequence of syllables from some
simple g2p program(generated), which does not consider sandhi,
return a the reference sequence while ignore sandhi.
Example
--------
['lao2', 'hu3'], ['lao3', 'hu3'] -> ['lao3', 'hu3']
"""
i = 0
j = 0
# sandhi ignored in the result while other errors are not included
result = []
while i < len(reference):
if erized(reference[i]):
result.append(reference[i])
i += 1
j += 2
elif reference[i][:-1] == generated[i][:-1] and reference[i][
-1] == '2' and generated[i][-1] == '3':
result.append(generated[i])
i += 1
j += 1
else:
result.append(reference[i])
i += 1
j += 1
assert j == len(
generated
), "length of transcriptions mismatch, There may be some characters that are ignored in the generated transcription."
return result
def convert_transcriptions(reference: Union[str, Path], generated: Union[str, Path], output: Union[str, Path]):
with open(reference, 'rt') as f_ref:
with open(generated, 'rt') as f_gen:
with open(output, 'wt') as f_out:
for i, (ref, gen) in enumerate(zip(f_ref, f_gen)):
sentence_id, ref_transcription = ref.strip().split(' ', 1)
_, gen_transcription = gen.strip().split(' ', 1)
try:
result = ignore_sandhi(ref_transcription.split(),
gen_transcription.split())
result = ' '.join(result)
except Exception:
print(
f"sentence_id: {sentence_id} There is some annotation error in the reference or generated transcription. Use the reference."
)
result = ref_transcription
f_out.write(f"{sentence_id} {result}\n")
if __name__ == "__main__":
parser = argparse.ArgumentParser(
description="reference transcription but ignore sandhi.")
parser.add_argument(
"--reference",
type=str,
help="path to the reference transcription of baker dataset.")
parser.add_argument(
"--generated", type=str, help="path to the generated transcription.")
parser.add_argument("--output", type=str, help="path to save result.")
args = parser.parse_args()
convert_transcriptions(args.reference, args.generated, args.output)

@ -0,0 +1,33 @@
#!/bin/bash
exp_dir="exp"
data_dir="data"
source ${MAIN_ROOT}/utils/parse_options.sh || exit -1
archive=${data_dir}/"BZNSYP.rar"
if [ ! -f ${archive} ]; then
echo "Baker Dataset not found! Download it first to the data_dir."
exit -1
fi
MD5='c4350563bf7dc298f7dd364b2607be83'
md5_result=$(md5sum ${archive} | awk -F[' '] '{print $1}')
if [ ${md5_result} != ${MD5} ]; then
echo "MD5 mismatch! The Archive has been changed."
exit -1
fi
label_file='ProsodyLabeling/000001-010000.txt'
filename='000001-010000.txt'
unrar e ${archive} ${label_file}
cp ${filename} ${exp_dir}
rm -f ${filename}
if [ ! -f ${exp_dir}/${filename} ];then
echo "File extraction failed!"
exit
fi
exit 0

@ -0,0 +1,8 @@
export MAIN_ROOT=${PWD}/../../
export PATH=${MAIN_ROOT}:${MAIN_ROOT}/utils:${PATH}
export LC_ALL=C
# Use UTF-8 in Python to avoid UnicodeDecodeError when LC_ALL=C
export PYTHONIOENCODING=UTF-8
export PYTHONPATH=${MAIN_ROOT}:${PYTHONPATH}

@ -0,0 +1,33 @@
#!/usr/bin/env bash
source path.sh
stage=-1
stop_stage=100
exp_dir=exp
data_dir=data
source ${MAIN_ROOT}/utils/parse_options.sh || exit -1
mkdir -p ${exp_dir}
if [ $stage -le 0 ] && [ $stop_stage -ge 0 ];then
echo "stage 0: Extracting Prosody Labeling"
bash local/prepare_dataset.sh --exp-dir ${exp_dir} --data-dir ${data_dir}
fi
# convert transcription in chinese into pinyin with pypinyin or jieba+pypinyin
filename="000001-010000.txt"
if [ $stage -le 1 ] && [ $stop_stage -ge 1 ]; then
echo "stage 1: Processing transcriptions..."
python3 local/extract_pinyin_label.py ${exp_dir}/${filename} ${exp_dir}/ref.pinyin
python3 local/convert_transcription.py ${exp_dir}/${filename} ${exp_dir}/trans.pinyin
python3 local/convert_transcription.py --use-jieba ${exp_dir}/${filename} ${exp_dir}/trans.jieba.pinyin
fi
echo "done"
exit 0

@ -1,6 +1,7 @@
# LibriSpeech
## CTC
## Deepspeech2
| Model | Config | Test set | WER |
| --- | --- | --- | --- |
| DeepSpeech2 | conf/deepspeech2.yaml | test-clean | 0.073973 |

@ -1,4 +1,13 @@
[
{
"type": "speed",
"params": {
"min_speed_rate": 0.9,
"max_speed_rate": 1.1,
"num_rates": 3
},
"prob": 0.0
},
{
"type": "shift",
"params": {

@ -10,9 +10,9 @@ data:
min_input_len: 0.0
max_input_len: 27.0 # second
min_output_len: 0.0
max_output_len: 400.0
min_output_input_ratio: 0.05
max_output_input_ratio: 10.0
max_output_len: .inf
min_output_input_ratio: 0.00
max_output_input_ratio: .inf
specgram_type: linear
target_sample_rate: 16000
max_freq: None
@ -41,7 +41,7 @@ training:
lr: 1e-3
lr_decay: 0.83
weight_decay: 1e-06
global_grad_clip: 5.0
global_grad_clip: 3.0
log_interval: 100
decoding:

@ -1,23 +0,0 @@
#! /usr/bin/env bash
if [ $# != 2 ];then
echo "usage: ${0} ckpt_dir avg_num"
exit -1
fi
ckpt_dir=${1}
average_num=${2}
decode_checkpoint=${ckpt_dir}/avg_${average_num}.pdparams
python3 -u ${MAIN_ROOT}/utils/avg_model.py \
--dst_model ${decode_checkpoint} \
--ckpt_dir ${ckpt_dir} \
--num ${average_num} \
--val_best
if [ $? -ne 0 ]; then
echo "Failed in avg ckpt!"
exit 1
fi
exit 0

@ -61,13 +61,13 @@ if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
num_workers=$(nproc)
python3 ${MAIN_ROOT}/utils/compute_mean_std.py \
--manifest_path="data/manifest.train.raw" \
--num_samples=-1 \
--num_samples=2000 \
--specgram_type="linear" \
--delta_delta=false \
--sample_rate=16000 \
--stride_ms=10.0 \
--window_ms=20.0 \
--use_dB_normalization=False \
--use_dB_normalization=True \
--num_workers=${num_workers} \
--output_path="data/mean_std.json"

@ -1,6 +1,6 @@
export MAIN_ROOT=${PWD}/../../../
export PATH=${MAIN_ROOT}:${PWD}/tools:${PATH}
export PATH=${MAIN_ROOT}:${MAIN_ROOT}/utils:${PATH}
export LC_ALL=C
# Use UTF-8 in Python to avoid UnicodeDecodeError when LC_ALL=C

@ -4,7 +4,7 @@ source path.sh
stage=0
stop_stage=100
conf_path=conf/transformer.yaml
conf_path=conf/deepspeech2.yaml
avg_num=30
source ${MAIN_ROOT}/utils/parse_options.sh || exit 1;
@ -19,7 +19,7 @@ fi
if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
# train model, all `ckpt` under `exp` dir
CUDA_VISIBLE_DEVICES=4,5,6,7 ./local/train.sh ${conf_path} ${ckpt}
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 ./local/train.sh ${conf_path} ${ckpt}
fi
if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then

@ -2,7 +2,7 @@
## Conformer
| Model | Config | Augmentation| Test set | Decode method | Loss | WER |
| --- | --- | --- | --- | --- | --- |
| --- | --- | --- | --- | --- | --- | --- |
| conformer | conf/conformer.yaml | spec_aug + shift | test-all | attention | test-all 6.35 | 0.057117 |
| conformer | conf/conformer.yaml | spec_aug + shift | test-clean | attention | test-all 6.35 | 0.030162 |
| conformer | conf/conformer.yaml | spec_aug + shift | test-clean | ctc_greedy_search | test-all 6.35 | 0.037910 |
@ -10,7 +10,8 @@
| conformer | conf/conformer.yaml | spec_aug + shift | test-clean | attention_rescoring | test-all 6.35 | 0.032115 |
## Transformer
| Model | Config | Augmentation| Test set | Decode method | Loss | WER |
| --- | --- | --- | --- | --- | --- |
| --- | --- | --- | --- | --- | --- | --- |
| transformer | conf/transformer.yaml | spec_aug + shift | test-all | attention | test-all 6.98 | 0.066500 |
| transformer | conf/transformer.yaml | spec_aug + shift | test-clean | attention | test-all 6.98 | 0.036 |

@ -77,7 +77,7 @@ model:
training:
n_epoch: 120
accum_grad: 8
global_grad_clip: 5.0
global_grad_clip: 3.0
optim: adam
optim_conf:
lr: 0.004

@ -1,6 +1,6 @@
export MAIN_ROOT=${PWD}/../../../
export PATH=${MAIN_ROOT}:${PWD}/tools:${PATH}
export PATH=${MAIN_ROOT}:${MAIN_ROOT}/utils:${PATH}
export LC_ALL=C
# Use UTF-8 in Python to avoid UnicodeDecodeError when LC_ALL=C

@ -0,0 +1,7 @@
# Ngram LM
Train chinese chararctor ngram lm by [kenlm](https://github.com/kpu/kenlm).
```
bash run.sh
```

@ -1,4 +1,6 @@
# SPM demo
# [SentencePiece Model](https://github.com/google/sentencepiece)
Train a `spm` model for English tokenizer.
```
bash run.sh

@ -1,6 +1,6 @@
export MAIN_ROOT=${PWD}/../../
export PATH=${MAIN_ROOT}:${PWD}/tools:${PATH}
export PATH=${MAIN_ROOT}:${MAIN_ROOT}/utils:${PATH}
export LC_ALL=C
# Use UTF-8 in Python to avoid UnicodeDecodeError when LC_ALL=C

@ -1,23 +0,0 @@
#! /usr/bin/env bash
if [ $# != 2 ];then
echo "usage: ${0} ckpt_dir avg_num"
exit -1
fi
ckpt_dir=${1}
average_num=${2}
decode_checkpoint=${ckpt_dir}/avg_${average_num}.pdparams
python3 -u ${MAIN_ROOT}/utils/avg_model.py \
--dst_model ${decode_checkpoint} \
--ckpt_dir ${ckpt_dir} \
--num ${average_num} \
--val_best
if [ $? -ne 0 ]; then
echo "Failed in avg ckpt!"
exit 1
fi
exit 0

@ -1,6 +1,6 @@
export MAIN_ROOT=${PWD}/../../../
export PATH=${MAIN_ROOT}:${PWD}/tools:${PATH}
export PATH=${MAIN_ROOT}:${MAIN_ROOT}/utils:${PATH}
export LC_ALL=C
# Use UTF-8 in Python to avoid UnicodeDecodeError when LC_ALL=C

@ -1,23 +0,0 @@
#! /usr/bin/env bash
if [ $# != 2 ];then
echo "usage: ${0} ckpt_dir avg_num"
exit -1
fi
ckpt_dir=${1}
average_num=${2}
decode_checkpoint=${ckpt_dir}/avg_${average_num}.pdparams
python3 -u ${MAIN_ROOT}/utils/avg_model.py \
--dst_model ${decode_checkpoint} \
--ckpt_dir ${ckpt_dir} \
--num ${average_num} \
--val_best
if [ $? -ne 0 ]; then
echo "Failed in avg ckpt!"
exit 1
fi
exit 0

@ -1,6 +1,6 @@
export MAIN_ROOT=${PWD}/../../../
export PATH=${MAIN_ROOT}:${PWD}/tools:${PATH}
export PATH=${MAIN_ROOT}:${MAIN_ROOT}/utils:${PATH}
export LC_ALL=C
# Use UTF-8 in Python to avoid UnicodeDecodeError when LC_ALL=C

@ -1,5 +1,6 @@
coverage
pre-commit
pybind11
resampy==0.2.2
scipy==1.2.1
sentencepiece
@ -7,6 +8,7 @@ snakeviz
SoundFile==0.9.0.post1
sox
tensorboardX
textgrid
typeguard
yacs
pybind11

@ -18,7 +18,6 @@ import paddle
from deepspeech.modules.mask import make_non_pad_mask
from deepspeech.modules.mask import make_pad_mask
from deepspeech.modules.mask import sequence_mask
class TestU2Model(unittest.TestCase):
@ -26,25 +25,21 @@ class TestU2Model(unittest.TestCase):
paddle.set_device('cpu')
self.lengths = paddle.to_tensor([5, 3, 2])
self.masks = np.array([
[1, 1, 1, 1, 1],
[1, 1, 1, 0, 0],
[1, 1, 0, 0, 0],
[True, True, True, True, True],
[True, True, True, False, False],
[True, True, False, False, False],
])
self.pad_masks = np.array([
[0, 0, 0, 0, 0],
[0, 0, 0, 1, 1],
[0, 0, 1, 1, 1],
[False, False, False, False, False],
[False, False, False, True, True],
[False, False, True, True, True],
])
def test_sequence_mask(self):
res = sequence_mask(self.lengths)
self.assertSequenceEqual(res.numpy().tolist(), self.masks.tolist())
def test_make_non_pad_mask(self):
res = make_non_pad_mask(self.lengths)
res1 = sequence_mask(self.lengths)
res2 = make_pad_mask(self.lengths).logical_not()
self.assertSequenceEqual(res.numpy().tolist(), self.masks.tolist())
self.assertSequenceEqual(res.numpy().tolist(), res1.numpy().tolist())
self.assertSequenceEqual(res.numpy().tolist(), res2.numpy().tolist())
def test_make_pad_mask(self):
res = make_pad_mask(self.lengths)

@ -18,3 +18,7 @@ licence: MIT
* [chinese_text_normalization](https://github.com/speechio/chinese_text_normalization.git)
commit: 9e92c7bf2d6b5a7974305406d8e240045beac51c
licence: MIT
* [phkit](https://github.com/KuangDD/phkit.git)
commit: b2100293c1e36da531d7f30bd52c9b955a649522
licence: None

@ -0,0 +1,155 @@
![phkit](phkit.png "phkit")
## phkit
phoneme toolkit: 拼音相关的文本处理工具箱,中文和英文的语音合成前端文本解决方案。
#### 安装
```
pip install -U phkit
```
#### 版本
v0.2.8
### pinyinkit
文本转拼音的模块依赖python-pinyinjiebaphrase-pinyin-data模块。
### chinese
适用于中文、英文和中英混合的音素,其中汉字拼音采用清华大学的音素,英文字符分字母和英文。
- 中文音素简介:
```
声母:
aa b c ch d ee f g h ii j k l m n oo p q r s sh t uu vv x z zh
韵母:
a ai an ang ao e ei en eng er i ia ian iang iao ie in ing iong iu ix iy iz o ong ou u ua uai uan uang ueng ui un uo v van ve vn ng uong
声调:
1 2 3 4 5
字母:
Aa Bb Cc Dd Ee Ff Gg Hh Ii Jj Kk Ll Mm Nn Oo Pp Qq Rr Ss Tt Uu Vv Ww Xx Yy Zz
英文:
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
标点:
! ? . , ; : " # ( )
注:!=!|?=?|.=.。|,=,,、|;=;|:=:|"="“|#=#   |(=([{{【<《|)=)]}}】>》
预留:
w y 0 6 7 8 9
w=%|y=$|0=0|6=6|7=7|8=8|9=9
其他:
_ ~ - *
```
#### symbol
音素标记。
中文音素,简单英文音素,简单中文音素。
#### sequence
转为序列的方法文本转为音素列表文本转为ID列表。
拼音变调,拼音转音素。
#### pinyin
转为拼音的方法,汉字转拼音,分离声调。
拼音为字母+数字形式例如pin1。
#### phoneme
音素映射表。
不带声调拼音转为音素,声调转音素,英文字母转音素,标点转音素。
#### number
数字读法。
按数值大小读,一个一个数字读。
#### convert
文本转换。
全角半角转换,简体繁体转换。
#### style
拼音格式转换。
国标样式的拼音和字母数字的样式的拼音相互转换。
### english
from https://github.com/keithito/tacotron "
Cleaners are transformations that run over the input text at both training and eval time.
Cleaners can be selected by passing a comma-delimited list of cleaner names as the "cleaners"
hyperparameter. Some cleaners are English-specific. You'll typically want to use:
1. "english_cleaners" for English text
2. "transliteration_cleaners" for non-English text that can be transliterated to ASCII using
the Unidecode library (https://pypi.python.org/pypi/Unidecode)
3. "basic_cleaners" if you do not want to transliterate (in this case, you should also update
the symbols in symbols.py to match your data).
### 历史版本
#### v0.2.8
- 文本转拼音轻声用5表示音调。
- 文本转拼音确保文本和拼音一一对应,文本长度和拼音列表长度相同。
- 增加拼音格式转换,国标格式和字母数字格式相互转换。
#### v0.2.7
- 所有中文音素都能被映射到。
#### v0.2.5
- 修正拼音转音素的潜在bug。
#### v0.2.4
- 修正几个默认拼音。
#### v0.2.3
- 汉字转拼音轻量化。
- 词语拼音词典去除全都是默认拼音的词语。
#### v0.2.2
- 修正安装依赖报错问题。
#### v0.2.1
- 增加中文的text_to_sequence方法可替换英文版本应对中文环境。
- 兼容v0.1.0之前版本需要在python3.7版本以上否则请改为从phkit.chinese导入模块。
#### v0.2.0
- 增加文本转拼音的模块依赖python-pinyinjiebaphrase-pinyin-data模块。
- 中文的音素方案移动到chinese模块。
#### v0.1.0
- 增加英文版本的音素方案,包括英文字母和英文音素。
- 增加简单的数字转中文的方法。
#### todo
```
文本正则化处理
数字读法
字符读法
常见规则读法
文本转拼音
pypinyin
国标和alnum转换
anything转音素
字符
英文
汉字
OOV
进阶:
分词
命名实体识别
依存句法分析
```

@ -0,0 +1,115 @@
#!usr/bin/env python
# -*- coding: utf-8 -*-
# author: kuangdd
# date: 2020/2/17
"""
![phkit](phkit.png "phkit")
## phkit
phoneme toolkit: 拼音相关的文本处理工具箱中文和英文的语音合成前端文本解决方案
#### 安装
```
pip install -U phkit
```
"""
__version__ = "0.2.8"
version_doc = """
#### 版本
v{}
""".format(__version__)
history_doc = """
### 历史版本
#### v0.2.8
- 文本转拼音轻声用5表示音调
- 文本转拼音确保文本和拼音一一对应文本长度和拼音列表长度相同
- 增加拼音格式转换国标格式和字母数字格式相互转换
#### v0.2.7
- 所有中文音素都能被映射到
#### v0.2.5
- 修正拼音转音素的潜在bug
#### v0.2.4
- 修正几个默认拼音
#### v0.2.3
- 汉字转拼音轻量化
- 词语拼音词典去除全都是默认拼音的词语
#### v0.2.2
- 修正安装依赖报错问题
#### v0.2.1
- 增加中文的text_to_sequence方法可替换英文版本应对中文环境
- 兼容v0.1.0之前版本需要在python3.7版本以上否则请改为从phkit.chinese导入模块
#### v0.2.0
- 增加文本转拼音的模块依赖python-pinyinjiebaphrase-pinyin-data模块
- 中文的音素方案移动到chinese模块
#### v0.1.0
- 增加英文版本的音素方案包括英文字母和英文音素
- 增加简单的数字转中文的方法
#### todo
```
文本正则化处理
数字读法
字符读法
常见规则读法
文本转拼音
pypinyin
国标和alnum转换
anything转音素
字符
英文
汉字
OOV
进阶:
分词
命名实体识别
依存句法分析
```
"""
from phkit.chinese import __doc__ as doc_chinese
from phkit.chinese.symbol import __doc__ as doc_symbol
from phkit.chinese.sequence import __doc__ as doc_sequence
from phkit.chinese.pinyin import __doc__ as doc_pinyin
from phkit.chinese.phoneme import __doc__ as doc_phoneme
from phkit.chinese.number import __doc__ as doc_number
from phkit.chinese.convert import __doc__ as doc_convert
from phkit.chinese.style import __doc__ as doc_style
from .english import __doc__ as doc_english
from .pinyinkit import __doc__ as doc_pinyinkit
readme_docs = [__doc__, version_doc,
doc_pinyinkit,
doc_chinese, doc_symbol, doc_sequence, doc_pinyin, doc_phoneme, doc_number, doc_convert, doc_style,
doc_english,
history_doc]
from .chinese import text_to_sequence as chinese_text_to_sequence, sequence_to_text as chinese_sequence_to_text
from .english import text_to_sequence as english_text_to_sequence, sequence_to_text as english_sequence_to_text
from .pinyinkit import lazy_pinyin
# 兼容0.1.0之前的版本python3.7以上版本支持。
from .chinese import convert, number, phoneme, sequence, symbol, style
from .chinese.style import guobiao2shengyundiao, shengyundiao2guobiao
from .chinese.convert import fan2jian, jian2fan, quan2ban, ban2quan
from .chinese.number import say_digit, say_decimal, say_number
from .chinese.pinyin import text2pinyin, split_pinyin
from .chinese.sequence import text2sequence, text2phoneme, pinyin2phoneme, phoneme2sequence, sequence2phoneme
from .chinese.sequence import symbol_chinese, ph2id_dict, id2ph_dict
if __name__ == "__main__":
print(__file__)

@ -0,0 +1,79 @@
"""
### chinese
适用于中文英文和中英混合的音素其中汉字拼音采用清华大学的音素英文字符分字母和英文
- 中文音素简介
```
声母
aa b c ch d ee f g h ii j k l m n oo p q r s sh t uu vv x z zh
韵母
a ai an ang ao e ei en eng er i ia ian iang iao ie in ing iong iu ix iy iz o ong ou u ua uai uan uang ueng ui un uo v van ve vn ng uong
声调
1 2 3 4 5
字母
Aa Bb Cc Dd Ee Ff Gg Hh Ii Jj Kk Ll Mm Nn Oo Pp Qq Rr Ss Tt Uu Vv Ww Xx Yy Zz
英文
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
标点
! ? . , ; : " # ( )
!=!|?=?|.=.|,=,|;=;|:=:|"="|#=#  \t|(=([{{【<《|)=)]}}】>》
预留
w y 0 6 7 8 9
w=%|y=$|0=0|6=6|7=7|8=8|9=9
其他
_ ~ - *
```
"""
from .convert import fan2jian, jian2fan, quan2ban, ban2quan
from .number import say_digit, say_decimal, say_number
from .pinyin import text2pinyin, split_pinyin
from .sequence import text2sequence, text2phoneme, pinyin2phoneme, phoneme2sequence, sequence2phoneme, change_diao
from .sequence import symbol_chinese, ph2id_dict, id2ph_dict
from .symbol import symbol_chinese as symbols
from .phoneme import shengyun2ph_dict
def text_to_sequence(src, cleaner_names=None, **kwargs):
"""
文本样例卡尔普陪外孙玩滑梯
拼音样例ka3 er3 pu3 pei2 wai4 sun1 wan2 hua2 ti1 .
:param src: str,拼音或文本字符串
:param cleaner_names: 文本处理方法选择暂时提供拼音和文本两种方法
:return: list,ID列表
"""
if cleaner_names == "pinyin":
pys = []
for py in src.split():
if py.isalnum():
pys.append(py)
else:
pys.append((py,))
phs = pinyin2phoneme(pys)
phs = change_diao(phs)
seq = phoneme2sequence(phs)
return seq
else:
return text2sequence(src)
def sequence_to_text(src):
out = sequence2phoneme(src)
return " ".join(out)
if __name__ == "__main__":
print(__file__)
text = "ka3 er3 pu3 pei2 wai4 sun1 wan2 hua2 ti1 . "
out = text_to_sequence(text)
print(out)
out = sequence_to_text(out)
print(out)

@ -0,0 +1,51 @@
#!usr/bin/env python
# -*- coding: utf-8 -*-
# author: kuangdd
# date: 2020/2/17
"""
#### convert
文本转换
全角半角转换简体繁体转换
"""
from .hanziconv import HanziConv
hc = HanziConv()
# 繁体转简体
fan2jian = hc.toSimplified
# 简体转繁体
jian2fan = hc.toTraditional
# 半角转全角映射表
ban2quan_dict = {i: i + 65248 for i in range(33, 127)}
ban2quan_dict.update({32: 12288})
# 全角转半角映射表
quan2ban_dict = {v: k for k, v in ban2quan_dict.items()}
def ban2quan(text: str):
"""
半角转全角
:param text:
:return:
"""
return text.translate(ban2quan_dict)
def quan2ban(text: str):
"""
全角转半角
:param text:
:return:
"""
return text.translate(quan2ban_dict)
if __name__ == "__main__":
assert ban2quan("aA1 ,:$。、") == "aA1 ,:$。、"
assert quan2ban("aA1 ,:$。、") == "aA1 ,:$。、"
assert jian2fan("中国语言") == "中國語言"
assert fan2jian("中國語言") == "中国语言"

@ -0,0 +1,99 @@
# Copyright 2014 Bernard Yue
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
__doc__ = """
Hanzi Converter 繁簡轉換器 | 繁简转换器
This module provides functions converting chinese text between simplified and
traditional characters. It returns unicode represnetation of the text.
Class HanziConv is the main entry point of the module, you can import the
class by doing:
>>> from hanziconv import HanziConv
"""
import os
from zhon import cedict
class HanziConv():
"""This class supports hanzi (漢字) convention between simplified and
traditional format"""
__traditional_charmap = cedict.traditional
__simplified_charmap = cedict.simplified
@classmethod
def __convert(cls, text, toTraditional=True):
"""Convert `text` to Traditional characters if `toTraditional` is
True, else convert to simplified characters
:param text: data to convert
:param toTraditional: True -- convert to traditional text
False -- covert to simplified text
:returns: converted 'text`
"""
if isinstance(text, bytes):
text = text.decode('utf-8')
fromMap = cls.__simplified_charmap
toMap = cls.__traditional_charmap
if not toTraditional:
fromMap = cls.__traditional_charmap
toMap = cls.__simplified_charmap
final = []
for c in text:
index = fromMap.find(c)
if index != -1:
final.append(toMap[index])
else:
final.append(c)
return ''.join(final)
@classmethod
def toSimplified(cls, text):
"""Convert `text` to simplified character string. Assuming text is
traditional character string
:param text: text to convert
:returns: converted UTF-8 characters
>>> from hanziconv import HanziConv
>>> print(HanziConv.toSimplified('繁簡轉換器'))
繁简转换器
"""
return cls.__convert(text, toTraditional=False)
@classmethod
def toTraditional(cls, text):
"""Convert `text` to traditional character string. Assuming text is
simplified character string
:param text: text to convert
:returns: converted UTF-8 characters
>>> from hanziconv import HanziConv
>>> print(HanziConv.toTraditional('繁简转换器'))
繁簡轉換器
"""
return cls.__convert(text, toTraditional=True)
@classmethod
def same(cls, text1, text2):
"""Return True if text1 and text2 meant literally the same, False
otherwise
:param text1: string to compare to ``text2``
:param text2: string to compare to ``text1``
:returns: **True** -- ``text1`` and ``text2`` are the same in meaning,
**False** -- otherwise
>>> from hanziconv import HanziConv
>>> print(HanziConv.same('繁简转换器', '繁簡轉換器'))
True
"""
t1 = cls.toSimplified(text1)
t2 = cls.toSimplified(text2)
return t1 == t2

@ -0,0 +1,90 @@
#!usr/bin/env python
# -*- coding: utf-8 -*-
# author: kuangdd
# date: 2020/2/16
"""
#### number
数字读法
按数值大小读一个一个数字读
"""
import re
_number_cn = ['', '', '', '', '', '', '', '', '', '']
_number_level = ['', '', '', '', '', '', '', '亿', '', '', '', '', '', '', '', '']
_zero = _number_cn[0]
_ten_re = re.compile(r'^一十')
_grade_level = {'', '亿', ''}
_number_group_re = re.compile(r"([0-9]+)")
def say_digit(num: str) -> str:
"""123 -> 一二三
Args:
num (str): digit
Returns:
str: hanzi number
"""
outs = []
for zi in num:
outs.append(_number_cn[int(zi)])
return ''.join(outs)
def say_number(num: str):
x = str(int(num))
if x == '0':
return _number_cn[0]
elif len(x) > 16:
return num
length = len(x)
outs = []
for num, zi in enumerate(x):
a = _number_cn[int(zi)]
b = _number_level[len(_number_level) - length + num]
if a != _zero:
outs.append(a)
outs.append(b)
else:
if b in _grade_level:
if outs[-1] != _zero:
outs.append(b)
else:
outs[-1] = b
else:
if outs[-1] != _zero:
outs.append(a)
out = ''.join(outs[:-1])
out = _ten_re.sub(r'', out)
return out
def say_decimal(num: str):
z, x = num.split('.')
z_cn = say_number(z)
x_cn = say_digit(x)
return z_cn + '' + x_cn
def convert_number(text):
parts = _number_group_re.split(text)
outs = []
for elem in parts:
if elem.isdigit():
if len(elem) <= 9:
outs.append(say_number(elem))
else:
outs.append(say_digit(elem))
else:
outs.append(elem)
return ''.join(outs)
if __name__ == "__main__":
print(__file__)
assert say_number("1234567890123456") == "一千二百三十四万五千六百七十八亿九千零一十二万三千四百五十六"
assert say_digit("123456") == "一二三四五六"
assert say_decimal("3.14") == "三点一四"
assert convert_number("hello314.1592and2718281828") == "hello三百一十四.一千五百九十二and二七一八二八一八二八"

@ -0,0 +1,480 @@
#!usr/bin/env python
# -*- coding: utf-8 -*-
# author: kuangdd
# date: 2020/2/16
"""
#### phoneme
音素映射表
不带声调拼音转为音素声调转音素英文字母转音素标点转音素
"""
# 拼音转音素映射表420
shengyun2ph_dict = {
'a': 'aa a',
'ai': 'aa ai',
'an': 'aa an',
'ang': 'aa ang',
'ao': 'aa ao',
'ba': 'b a',
'bai': 'b ai',
'ban': 'b an',
'bang': 'b ang',
'bao': 'b ao',
'bei': 'b ei',
'ben': 'b en',
'beng': 'b eng',
'bi': 'b i',
'bian': 'b ian',
'biao': 'b iao',
'bie': 'b ie',
'bin': 'b in',
'bing': 'b ing',
'bo': 'b o',
'bu': 'b u',
'ca': 'c a',
'cai': 'c ai',
'can': 'c an',
'cang': 'c ang',
'cao': 'c ao',
'ce': 'c e',
'cen': 'c en',
'ceng': 'c eng',
'ci': 'c iy',
'cong': 'c ong',
'cou': 'c ou',
'cu': 'c u',
'cuan': 'c uan',
'cui': 'c ui',
'cun': 'c un',
'cuo': 'c uo',
'cha': 'ch a',
'chai': 'ch ai',
'chan': 'ch an',
'chang': 'ch ang',
'chao': 'ch ao',
'che': 'ch e',
'chen': 'ch en',
'cheng': 'ch eng',
'chi': 'ch ix',
'chong': 'ch ong',
'chou': 'ch ou',
'chu': 'ch u',
'chuai': 'ch uai',
'chuan': 'ch uan',
'chuang': 'ch uang',
'chui': 'ch ui',
'chun': 'ch un',
'chuo': 'ch uo',
'da': 'd a',
'dai': 'd ai',
'dan': 'd an',
'dang': 'd ang',
'dao': 'd ao',
'de': 'd e',
'dei': 'd ei',
'deng': 'd eng',
'di': 'd i',
'dia': 'd ia',
'dian': 'd ian',
'diao': 'd iao',
'die': 'd ie',
'ding': 'd ing',
'diu': 'd iu',
'dong': 'd ong',
'dou': 'd ou',
'du': 'd u',
'duan': 'd uan',
'dui': 'd ui',
'dun': 'd un',
'duo': 'd uo',
'e': 'ee e',
'ei': 'ee ei',
'en': 'ee en',
'er': 'ee er',
'fa': 'f a',
'fan': 'f an',
'fang': 'f ang',
'fei': 'f ei',
'fen': 'f en',
'feng': 'f eng',
'fo': 'f o',
'fou': 'f ou',
'fu': 'f u',
'ga': 'g a',
'gai': 'g ai',
'gan': 'g an',
'gang': 'g ang',
'gao': 'g ao',
'ge': 'g e',
'gei': 'g ei',
'gen': 'g en',
'geng': 'g eng',
'gong': 'g ong',
'gou': 'g ou',
'gu': 'g u',
'gua': 'g ua',
'guai': 'g uai',
'guan': 'g uan',
'guang': 'g uang',
'gui': 'g ui',
'gun': 'g un',
'guo': 'g uo',
'ha': 'h a',
'hai': 'h ai',
'han': 'h an',
'hang': 'h ang',
'hao': 'h ao',
'he': 'h e',
'hei': 'h ei',
'hen': 'h en',
'heng': 'h eng',
'hong': 'h ong',
'hou': 'h ou',
'hu': 'h u',
'hua': 'h ua',
'huai': 'h uai',
'huan': 'h uan',
'huang': 'h uang',
'hui': 'h ui',
'hun': 'h un',
'huo': 'h uo',
'yi': 'ii i',
'ya': 'ii ia',
'yan': 'ii ian',
'yang': 'ii iang',
'yao': 'ii iao',
'ye': 'ii ie',
'yin': 'ii in',
'ying': 'ii ing',
'yong': 'ii iong',
'you': 'ii iu',
'ji': 'j i',
'jia': 'j ia',
'jian': 'j ian',
'jiang': 'j iang',
'jiao': 'j iao',
'jie': 'j ie',
'jin': 'j in',
'jing': 'j ing',
'jiong': 'j iong',
'jiu': 'j iu',
'ju': 'j v',
'juan': 'j van',
'jue': 'j ve',
'jun': 'j vn',
'ka': 'k a',
'kai': 'k ai',
'kan': 'k an',
'kang': 'k ang',
'kao': 'k ao',
'ke': 'k e',
'ken': 'k en',
'keng': 'k eng',
'kong': 'k ong',
'kou': 'k ou',
'ku': 'k u',
'kua': 'k ua',
'kuai': 'k uai',
'kuan': 'k uan',
'kuang': 'k uang',
'kui': 'k ui',
'kun': 'k un',
'kuo': 'k uo',
'la': 'l a',
'lai': 'l ai',
'lan': 'l an',
'lang': 'l ang',
'lao': 'l ao',
'le': 'l e',
'lei': 'l ei',
'leng': 'l eng',
'li': 'l i',
'lia': 'l ia',
'lian': 'l ian',
'liang': 'l iang',
'liao': 'l iao',
'lie': 'l ie',
'lin': 'l in',
'ling': 'l ing',
'liu': 'l iu',
'lo': 'l o',
'long': 'l ong',
'lou': 'l ou',
'lu': 'l u',
'luan': 'l uan',
'lun': 'l un',
'luo': 'l uo',
'lv': 'l v',
'lve': 'l ve',
'ma': 'm a',
'mai': 'm ai',
'man': 'm an',
'mang': 'm ang',
'mao': 'm ao',
'me': 'm e',
'mei': 'm ei',
'men': 'm en',
'meng': 'm eng',
'mi': 'm i',
'mian': 'm ian',
'miao': 'm iao',
'mie': 'm ie',
'min': 'm in',
'ming': 'm ing',
'miu': 'm iu',
'mo': 'm o',
'mou': 'm ou',
'mu': 'm u',
'na': 'n a',
'nai': 'n ai',
'nan': 'n an',
'nang': 'n ang',
'nao': 'n ao',
'ne': 'n e',
'nei': 'n ei',
'nen': 'n en',
'neng': 'n eng',
'ni': 'n i',
'nian': 'n ian',
'niang': 'n iang',
'niao': 'n iao',
'nie': 'n ie',
'nin': 'n in',
'ning': 'n ing',
'niu': 'n iu',
'nong': 'n ong',
'nu': 'n u',
'nuan': 'n uan',
'nuo': 'n uo',
'nv': 'n v',
'nve': 'n ve',
'o': 'oo o',
'ou': 'oo ou',
'pa': 'p a',
'pai': 'p ai',
'pan': 'p an',
'pang': 'p ang',
'pao': 'p ao',
'pei': 'p ei',
'pen': 'p en',
'peng': 'p eng',
'pi': 'p i',
'pian': 'p ian',
'piao': 'p iao',
'pie': 'p ie',
'pin': 'p in',
'ping': 'p ing',
'po': 'p o',
'pou': 'p ou',
'pu': 'p u',
'qi': 'q i',
'qia': 'q ia',
'qian': 'q ian',
'qiang': 'q iang',
'qiao': 'q iao',
'qie': 'q ie',
'qin': 'q in',
'qing': 'q ing',
'qiong': 'q iong',
'qiu': 'q iu',
'qu': 'q v',
'quan': 'q van',
'que': 'q ve',
'qun': 'q vn',
'ran': 'r an',
'rang': 'r ang',
'rao': 'r ao',
're': 'r e',
'ren': 'r en',
'reng': 'r eng',
'ri': 'r iz',
'rong': 'r ong',
'rou': 'r ou',
'ru': 'r u',
'ruan': 'r uan',
'rui': 'r ui',
'run': 'r un',
'ruo': 'r uo',
'sa': 's a',
'sai': 's ai',
'san': 's an',
'sang': 's ang',
'sao': 's ao',
'se': 's e',
'sen': 's en',
'seng': 's eng',
'si': 's iy',
'song': 's ong',
'sou': 's ou',
'su': 's u',
'suan': 's uan',
'sui': 's ui',
'sun': 's un',
'suo': 's uo',
'sha': 'sh a',
'shai': 'sh ai',
'shan': 'sh an',
'shang': 'sh ang',
'shao': 'sh ao',
'she': 'sh e',
'shei': 'sh ei',
'shen': 'sh en',
'sheng': 'sh eng',
'shi': 'sh ix',
'shou': 'sh ou',
'shu': 'sh u',
'shua': 'sh ua',
'shuai': 'sh uai',
'shuan': 'sh uan',
'shuang': 'sh uang',
'shui': 'sh ui',
'shun': 'sh un',
'shuo': 'sh uo',
'ta': 't a',
'tai': 't ai',
'tan': 't an',
'tang': 't ang',
'tao': 't ao',
'te': 't e',
'teng': 't eng',
'ti': 't i',
'tian': 't ian',
'tiao': 't iao',
'tie': 't ie',
'ting': 't ing',
'tong': 't ong',
'tou': 't ou',
'tu': 't u',
'tuan': 't uan',
'tui': 't ui',
'tun': 't un',
'tuo': 't uo',
'wu': 'uu u',
'wa': 'uu ua',
'wai': 'uu uai',
'wan': 'uu uan',
'wang': 'uu uang',
'weng': 'uu ueng',
'wei': 'uu ui',
'wen': 'uu un',
'wo': 'uu uo',
'yu': 'vv v',
'yuan': 'vv van',
'yue': 'vv ve',
'yun': 'vv vn',
'xi': 'x i',
'xia': 'x ia',
'xian': 'x ian',
'xiang': 'x iang',
'xiao': 'x iao',
'xie': 'x ie',
'xin': 'x in',
'xing': 'x ing',
'xiong': 'x iong',
'xiu': 'x iu',
'xu': 'x v',
'xuan': 'x van',
'xue': 'x ve',
'xun': 'x vn',
'za': 'z a',
'zai': 'z ai',
'zan': 'z an',
'zang': 'z ang',
'zao': 'z ao',
'ze': 'z e',
'zei': 'z ei',
'zen': 'z en',
'zeng': 'z eng',
'zi': 'z iy',
'zong': 'z ong',
'zou': 'z ou',
'zu': 'z u',
'zuan': 'z uan',
'zui': 'z ui',
'zun': 'z un',
'zuo': 'z uo',
'zha': 'zh a',
'zhai': 'zh ai',
'zhan': 'zh an',
'zhang': 'zh ang',
'zhao': 'zh ao',
'zhe': 'zh e',
'zhei': 'zh ei',
'zhen': 'zh en',
'zheng': 'zh eng',
'zhi': 'zh ix',
'zhong': 'zh ong',
'zhou': 'zh ou',
'zhu': 'zh u',
'zhua': 'zh ua',
'zhuai': 'zh uai',
'zhuan': 'zh uan',
'zhuang': 'zh uang',
'zhui': 'zh ui',
'zhun': 'zh un',
'zhuo': 'zh uo',
'cei': 'c ei',
'chua': 'ch ua',
'den': 'd en',
'din': 'd in',
'eng': 'ee eng',
'ng': 'ee ng',
'fiao': 'f iao',
'yo': 'ii o',
'kei': 'k ei',
'len': 'l en',
'nia': 'n ia',
'nou': 'n ou',
'nun': 'n un',
'rua': 'r ua',
'tei': 't ei',
'wong': 'uu uong',
'n': 'n ng'
}
diao2ph_dict = {'1': '1', '2': '2', '3': '3', '4': '4', '5': '5'}
# 字母音素26
_alphabet = 'Aa Bb Cc Dd Ee Ff Gg Hh Ii Jj Kk Ll Mm Nn Oo Pp Qq Rr Ss Tt Uu Vv Ww Xx Yy Zz'.split()
# 字母26
_upper = list('ABCDEFGHIJKLMNOPQRSTUVWXYZ')
_lower = list('abcdefghijklmnopqrstuvwxyz')
upper2ph_dict = dict(zip(_upper, _alphabet))
lower2ph_dict = dict(zip(_lower, _upper))
# 标点9
_biaodian = '! ? . , ; : " # ( )'.split()
# 注:!=!|?=?|.=.。|,=,,、|;=;|:=:|"="“”'|#=  \t|(=([{{【<《|)=)]}}】>》
biao2ph_dict = {
'!': '!', '': '!',
'?': '?', '': '?',
'.': '.', '': '.',
',': ',', '': ',', '': ',',
';': ';', '': ';',
':': ':', '': ':',
'"': '"', '': '"', '': '"', "'": '"', '': '"', '': '"',
'#': '#', '': '#', ' ': '#', ' ': '#', '\t': '#',
'(': '(', '': '(', '[': '(', '': '(', '{': '(', '': '(', '': '(', '<': '(', '': '(',
')': ')', '': ')', ']': ')', '': ')', '}': ')', '': ')', '': ')', '>': ')', '': ')'
}
# 其他7
_other = 'w y 0 6 7 8 9'.split()
other2ph_dict = {
'%': 'w',
'$': 'y',
'0': '0',
'6': '6',
'7': '7',
'8': '8',
'9': '9'
}
char2ph_dict = {**upper2ph_dict, **lower2ph_dict, **biao2ph_dict, **other2ph_dict}
if __name__ == "__main__":
print(__file__)

@ -0,0 +1,11 @@
#!usr/bin/env python
# -*- coding: utf-8 -*-
# author: kuangdd
# date: 2020/2/17
"""
#### pinyin
转为拼音的方法汉字转拼音分离声调
拼音为字母+数字形式例如pin1
"""
from ..pinyinkit import text2pinyin, split_pinyin

@ -0,0 +1,153 @@
#!usr/bin/env python
# -*- coding: utf-8 -*-
# author: kuangdd
# date: 2020/2/16
"""
#### sequence
转为序列的方法文本转为音素列表文本转为ID列表
拼音变调拼音转音素
"""
from .phoneme import shengyun2ph_dict, diao2ph_dict, char2ph_dict
from .pinyin import text2pinyin, split_pinyin
from .symbol import _chain, _eos, _pad, symbol_chinese
from .convert import fan2jian, quan2ban
from .number import convert_number
import re
# 分隔英文字母
_en_re = re.compile(r"([a-zA-Z]+)")
phs = ({w for p in shengyun2ph_dict.values() for w in p.split()}
| set(diao2ph_dict.values()) | set(char2ph_dict.values()))
assert bool(phs - set(symbol_chinese)) is False
ph2id_dict = {p: i for i, p in enumerate(symbol_chinese)}
id2ph_dict = {i: p for i, p in enumerate(symbol_chinese)}
assert len(ph2id_dict) == len(id2ph_dict)
def text2phoneme(text):
"""
文本转为音素用中文音素方案
中文转为拼音按照清华大学方案转为音素分为辅音元音音调
英文全部大写转为字母读音
英文非全部大写转为英文读音
标点映射为音素
:param text: str,正则化后的文本
:return: list,音素列表
"""
text = normalize_chinese(text)
text = normalize_english(text)
pys = text2pinyin(text, errors=lambda x: (x,))
phs = pinyin2phoneme(pys)
phs = change_diao(phs)
return phs
def text2sequence(text):
"""
文本转为ID序列
:param text:
:return:
"""
phs = text2phoneme(text)
seq = phoneme2sequence(phs)
return seq
def pinyin2phoneme(src):
"""
拼音或其他字符转音素
:param src: list,拼音用str格式其他用tuple格式
:return: list
"""
out = []
for py in src:
if type(py) is str:
fuyuan, diao = split_pinyin(py)
if fuyuan in shengyun2ph_dict and diao in diao2ph_dict:
phs = shengyun2ph_dict[fuyuan].split()
phs.append(diao2ph_dict[diao])
else:
phs = py_errors(py)
else:
phs = []
for w in py:
ph = py_errors(w)
phs.extend(ph)
if phs:
out.extend(phs)
out.append(_chain)
out.append(_eos)
out.append(_pad)
return out
def change_diao(src):
"""
拼音变声调连续上声声调的把前一个上声变为阳平
:param src: list,音素列表
:return: list,变调后的音素列表
"""
flag = -5
out = []
for i, w in enumerate(reversed(src)):
if w == '3':
if i - flag == 4:
out.append('2')
else:
flag = i
out.append(w)
else:
out.append(w)
return list(reversed(out))
def phoneme2sequence(src):
out = []
for w in src:
if w in ph2id_dict:
out.append(ph2id_dict[w])
return out
def sequence2phoneme(src):
out = []
for w in src:
if w in id2ph_dict:
out.append(id2ph_dict[w])
return out
def py_errors(text):
out = []
for p in text:
if p in char2ph_dict:
out.append(char2ph_dict[p])
return out
def normalize_chinese(text):
text = quan2ban(text)
text = fan2jian(text)
text = convert_number(text)
return text
def normalize_english(text):
out = []
parts = _en_re.split(text)
for part in parts:
if not part.isupper():
out.append(part.lower())
else:
out.append(part)
return "".join(out)
if __name__ == "__main__":
print(__file__)

@ -0,0 +1,339 @@
# author: kuangdd
# date: 2021/5/8
"""
#### style
拼音格式转换
国标样式的拼音和字母数字的样式的拼音相互转换
"""
from pathlib import Path
import logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(Path(__file__).stem)
# 2100 = 420 * 5
guobiao2shengyundiao_dict = {
'a': 'a5', 'ā': 'a1', 'á': 'a2', 'ǎ': 'a3', 'à': 'a4', 'ai': 'ai5', 'āi': 'ai1', 'ái': 'ai2', 'ǎi': 'ai3',
'ài': 'ai4', 'an': 'an5', 'ān': 'an1', 'án': 'an2', 'ǎn': 'an3', 'àn': 'an4', 'ang': 'ang5', 'āng': 'ang1',
'áng': 'ang2', 'ǎng': 'ang3', 'àng': 'ang4', 'ao': 'ao5', 'āo': 'ao1', 'áo': 'ao2', 'ǎo': 'ao3', 'ào': 'ao4',
'ba': 'ba5', '': 'ba1', '': 'ba2', '': 'ba3', '': 'ba4', 'bai': 'bai5', 'bāi': 'bai1', 'bái': 'bai2',
'bǎi': 'bai3', 'bài': 'bai4', 'ban': 'ban5', 'bān': 'ban1', 'bán': 'ban2', 'bǎn': 'ban3', 'bàn': 'ban4',
'bang': 'bang5', 'bāng': 'bang1', 'báng': 'bang2', 'bǎng': 'bang3', 'bàng': 'bang4', 'bao': 'bao5', 'bāo': 'bao1',
'báo': 'bao2', 'bǎo': 'bao3', 'bào': 'bao4', 'bei': 'bei5', 'bēi': 'bei1', 'béi': 'bei2', 'běi': 'bei3',
'bèi': 'bei4', 'ben': 'ben5', 'bēn': 'ben1', 'bén': 'ben2', 'běn': 'ben3', 'bèn': 'ben4', 'beng': 'beng5',
'bēng': 'beng1', 'béng': 'beng2', 'běng': 'beng3', 'bèng': 'beng4', 'bi': 'bi5', '': 'bi1', '': 'bi2',
'': 'bi3', '': 'bi4', 'bian': 'bian5', 'biān': 'bian1', 'bián': 'bian2', 'biǎn': 'bian3', 'biàn': 'bian4',
'biao': 'biao5', 'biāo': 'biao1', 'biáo': 'biao2', 'biǎo': 'biao3', 'biào': 'biao4', 'bie': 'bie5', 'biē': 'bie1',
'bié': 'bie2', 'biě': 'bie3', 'biè': 'bie4', 'bin': 'bin5', 'bīn': 'bin1', 'bín': 'bin2', 'bǐn': 'bin3',
'bìn': 'bin4', 'bing': 'bing5', 'bīng': 'bing1', 'bíng': 'bing2', 'bǐng': 'bing3', 'bìng': 'bing4', 'bo': 'bo5',
'': 'bo1', '': 'bo2', '': 'bo3', '': 'bo4', 'bu': 'bu5', '': 'bu1', '': 'bu2', '': 'bu3', '': 'bu4',
'ca': 'ca5', '': 'ca1', '': 'ca2', '': 'ca3', '': 'ca4', 'cai': 'cai5', 'cāi': 'cai1', 'cái': 'cai2',
'cǎi': 'cai3', 'cài': 'cai4', 'can': 'can5', 'cān': 'can1', 'cán': 'can2', 'cǎn': 'can3', 'càn': 'can4',
'cang': 'cang5', 'cāng': 'cang1', 'cáng': 'cang2', 'cǎng': 'cang3', 'càng': 'cang4', 'cao': 'cao5', 'cāo': 'cao1',
'cáo': 'cao2', 'cǎo': 'cao3', 'cào': 'cao4', 'ce': 'ce5', '': 'ce1', '': 'ce2', '': 'ce3', '': 'ce4',
'cen': 'cen5', 'cēn': 'cen1', 'cén': 'cen2', 'cěn': 'cen3', 'cèn': 'cen4', 'ceng': 'ceng5', 'cēng': 'ceng1',
'céng': 'ceng2', 'cěng': 'ceng3', 'cèng': 'ceng4', 'cha': 'cha5', 'chā': 'cha1', 'chá': 'cha2', 'chǎ': 'cha3',
'chà': 'cha4', 'chai': 'chai5', 'chāi': 'chai1', 'chái': 'chai2', 'chǎi': 'chai3', 'chài': 'chai4', 'chan': 'chan5',
'chān': 'chan1', 'chán': 'chan2', 'chǎn': 'chan3', 'chàn': 'chan4', 'chang': 'chang5', 'chāng': 'chang1',
'cháng': 'chang2', 'chǎng': 'chang3', 'chàng': 'chang4', 'chao': 'chao5', 'chāo': 'chao1', 'cháo': 'chao2',
'chǎo': 'chao3', 'chào': 'chao4', 'che': 'che5', 'chē': 'che1', 'ché': 'che2', 'chě': 'che3', 'chè': 'che4',
'chen': 'chen5', 'chēn': 'chen1', 'chén': 'chen2', 'chěn': 'chen3', 'chèn': 'chen4', 'cheng': 'cheng5',
'chēng': 'cheng1', 'chéng': 'cheng2', 'chěng': 'cheng3', 'chèng': 'cheng4', 'chi': 'chi5', 'chī': 'chi1',
'chí': 'chi2', 'chǐ': 'chi3', 'chì': 'chi4', 'chong': 'chong5', 'chōng': 'chong1', 'chóng': 'chong2',
'chǒng': 'chong3', 'chòng': 'chong4', 'chou': 'chou5', 'chōu': 'chou1', 'chóu': 'chou2', 'chǒu': 'chou3',
'chòu': 'chou4', 'chu': 'chu5', 'chū': 'chu1', 'chú': 'chu2', 'chǔ': 'chu3', 'chù': 'chu4', 'chuai': 'chuai5',
'chuāi': 'chuai1', 'chuái': 'chuai2', 'chuǎi': 'chuai3', 'chuài': 'chuai4', 'chuan': 'chuan5', 'chuān': 'chuan1',
'chuán': 'chuan2', 'chuǎn': 'chuan3', 'chuàn': 'chuan4', 'chuang': 'chuang5', 'chuāng': 'chuang1',
'chuáng': 'chuang2', 'chuǎng': 'chuang3', 'chuàng': 'chuang4', 'chui': 'chui5', 'chuī': 'chui1', 'chuí': 'chui2',
'chuǐ': 'chui3', 'chuì': 'chui4', 'chun': 'chun5', 'chūn': 'chun1', 'chún': 'chun2', 'chǔn': 'chun3',
'chùn': 'chun4', 'chuo': 'chuo5', 'chuō': 'chuo1', 'chuó': 'chuo2', 'chuǒ': 'chuo3', 'chuò': 'chuo4', 'ci': 'ci5',
'': 'ci1', '': 'ci2', '': 'ci3', '': 'ci4', 'cong': 'cong5', 'cōng': 'cong1', 'cóng': 'cong2',
'cǒng': 'cong3', 'còng': 'cong4', 'cou': 'cou5', 'cōu': 'cou1', 'cóu': 'cou2', 'cǒu': 'cou3', 'còu': 'cou4',
'cu': 'cu5', '': 'cu1', '': 'cu2', '': 'cu3', '': 'cu4', 'cuan': 'cuan5', 'cuān': 'cuan1', 'cuán': 'cuan2',
'cuǎn': 'cuan3', 'cuàn': 'cuan4', 'cui': 'cui5', 'cuī': 'cui1', 'cuí': 'cui2', 'cuǐ': 'cui3', 'cuì': 'cui4',
'cun': 'cun5', 'cūn': 'cun1', 'cún': 'cun2', 'cǔn': 'cun3', 'cùn': 'cun4', 'cuo': 'cuo5', 'cuō': 'cuo1',
'cuó': 'cuo2', 'cuǒ': 'cuo3', 'cuò': 'cuo4', 'da': 'da5', '': 'da1', '': 'da2', '': 'da3', '': 'da4',
'dai': 'dai5', 'dāi': 'dai1', 'dái': 'dai2', 'dǎi': 'dai3', 'dài': 'dai4', 'dan': 'dan5', 'dān': 'dan1',
'dán': 'dan2', 'dǎn': 'dan3', 'dàn': 'dan4', 'dang': 'dang5', 'dāng': 'dang1', 'dáng': 'dang2', 'dǎng': 'dang3',
'dàng': 'dang4', 'dao': 'dao5', 'dāo': 'dao1', 'dáo': 'dao2', 'dǎo': 'dao3', 'dào': 'dao4', 'de': 'de5',
'': 'de1', '': 'de2', '': 'de3', '': 'de4', 'dei': 'dei5', 'dēi': 'dei1', 'déi': 'dei2', 'děi': 'dei3',
'dèi': 'dei4', 'den': 'den5', 'dēn': 'den1', 'dén': 'den2', 'děn': 'den3', 'dèn': 'den4', 'deng': 'deng5',
'dēng': 'deng1', 'déng': 'deng2', 'děng': 'deng3', 'dèng': 'deng4', 'di': 'di5', '': 'di1', '': 'di2',
'': 'di3', '': 'di4', 'dia': 'dia5', 'diā': 'dia1', 'diá': 'dia2', 'diǎ': 'dia3', 'dià': 'dia4',
'dian': 'dian5', 'diān': 'dian1', 'dián': 'dian2', 'diǎn': 'dian3', 'diàn': 'dian4', 'diao': 'diao5',
'diāo': 'diao1', 'diáo': 'diao2', 'diǎo': 'diao3', 'diào': 'diao4', 'die': 'die5', 'diē': 'die1', 'dié': 'die2',
'diě': 'die3', 'diè': 'die4', 'ding': 'ding5', 'dīng': 'ding1', 'díng': 'ding2', 'dǐng': 'ding3', 'dìng': 'ding4',
'diu': 'diu5', 'diū': 'diu1', 'diú': 'diu2', 'diǔ': 'diu3', 'diù': 'diu4', 'dong': 'dong5', 'dōng': 'dong1',
'dóng': 'dong2', 'dǒng': 'dong3', 'dòng': 'dong4', 'dou': 'dou5', 'dōu': 'dou1', 'dóu': 'dou2', 'dǒu': 'dou3',
'dòu': 'dou4', 'du': 'du5', '': 'du1', '': 'du2', '': 'du3', '': 'du4', 'duan': 'duan5', 'duān': 'duan1',
'duán': 'duan2', 'duǎn': 'duan3', 'duàn': 'duan4', 'dui': 'dui5', 'duī': 'dui1', 'duí': 'dui2', 'duǐ': 'dui3',
'duì': 'dui4', 'dun': 'dun5', 'dūn': 'dun1', 'dún': 'dun2', 'dǔn': 'dun3', 'dùn': 'dun4', 'duo': 'duo5',
'duō': 'duo1', 'duó': 'duo2', 'duǒ': 'duo3', 'duò': 'duo4', 'e': 'e5', 'ē': 'e1', 'é': 'e2', 'ě': 'e3', 'è': 'e4',
'ei': 'ei5', 'ēi': 'ei1', 'éi': 'ei2', 'ěi': 'ei3', 'èi': 'ei4', 'en': 'en5', 'ēn': 'en1', 'én': 'en2', 'ěn': 'en3',
'èn': 'en4', 'eng': 'eng5', 'ēng': 'eng1', 'éng': 'eng2', 'ěng': 'eng3', 'èng': 'eng4', 'er': 'er5', 'ēr': 'er1',
'ér': 'er2', 'ěr': 'er3', 'èr': 'er4', 'fa': 'fa5', '': 'fa1', '': 'fa2', '': 'fa3', '': 'fa4',
'fan': 'fan5', 'fān': 'fan1', 'fán': 'fan2', 'fǎn': 'fan3', 'fàn': 'fan4', 'fang': 'fang5', 'fāng': 'fang1',
'fáng': 'fang2', 'fǎng': 'fang3', 'fàng': 'fang4', 'fei': 'fei5', 'fēi': 'fei1', 'féi': 'fei2', 'fěi': 'fei3',
'fèi': 'fei4', 'fen': 'fen5', 'fēn': 'fen1', 'fén': 'fen2', 'fěn': 'fen3', 'fèn': 'fen4', 'feng': 'feng5',
'fēng': 'feng1', 'féng': 'feng2', 'fěng': 'feng3', 'fèng': 'feng4', 'fo': 'fo5', '': 'fo1', '': 'fo2',
'': 'fo3', '': 'fo4', 'fou': 'fou5', 'fōu': 'fou1', 'fóu': 'fou2', 'fǒu': 'fou3', 'fòu': 'fou4', 'fu': 'fu5',
'': 'fu1', '': 'fu2', '': 'fu3', '': 'fu4', 'ga': 'ga5', '': 'ga1', '': 'ga2', '': 'ga3', '': 'ga4',
'gai': 'gai5', 'gāi': 'gai1', 'gái': 'gai2', 'gǎi': 'gai3', 'gài': 'gai4', 'gan': 'gan5', 'gān': 'gan1',
'gán': 'gan2', 'gǎn': 'gan3', 'gàn': 'gan4', 'gang': 'gang5', 'gāng': 'gang1', 'gáng': 'gang2', 'gǎng': 'gang3',
'gàng': 'gang4', 'gao': 'gao5', 'gāo': 'gao1', 'gáo': 'gao2', 'gǎo': 'gao3', 'gào': 'gao4', 'ge': 'ge5',
'': 'ge1', '': 'ge2', '': 'ge3', '': 'ge4', 'gei': 'gei5', 'gēi': 'gei1', 'géi': 'gei2', 'gěi': 'gei3',
'gèi': 'gei4', 'gen': 'gen5', 'gēn': 'gen1', 'gén': 'gen2', 'gěn': 'gen3', 'gèn': 'gen4', 'geng': 'geng5',
'gēng': 'geng1', 'géng': 'geng2', 'gěng': 'geng3', 'gèng': 'geng4', 'gong': 'gong5', 'gōng': 'gong1',
'góng': 'gong2', 'gǒng': 'gong3', 'gòng': 'gong4', 'gou': 'gou5', 'gōu': 'gou1', 'góu': 'gou2', 'gǒu': 'gou3',
'gòu': 'gou4', 'gu': 'gu5', '': 'gu1', '': 'gu2', '': 'gu3', '': 'gu4', 'gua': 'gua5', 'guā': 'gua1',
'guá': 'gua2', 'guǎ': 'gua3', 'guà': 'gua4', 'guai': 'guai5', 'guāi': 'guai1', 'guái': 'guai2', 'guǎi': 'guai3',
'guài': 'guai4', 'guan': 'guan5', 'guān': 'guan1', 'guán': 'guan2', 'guǎn': 'guan3', 'guàn': 'guan4',
'guang': 'guang5', 'guāng': 'guang1', 'guáng': 'guang2', 'guǎng': 'guang3', 'guàng': 'guang4', 'gui': 'gui5',
'guī': 'gui1', 'guí': 'gui2', 'guǐ': 'gui3', 'guì': 'gui4', 'gun': 'gun5', 'gūn': 'gun1', 'gún': 'gun2',
'gǔn': 'gun3', 'gùn': 'gun4', 'guo': 'guo5', 'guō': 'guo1', 'guó': 'guo2', 'guǒ': 'guo3', 'guò': 'guo4',
'ha': 'ha5', '': 'ha1', '': 'ha2', '': 'ha3', '': 'ha4', 'hai': 'hai5', 'hāi': 'hai1', 'hái': 'hai2',
'hǎi': 'hai3', 'hài': 'hai4', 'han': 'han5', 'hān': 'han1', 'hán': 'han2', 'hǎn': 'han3', 'hàn': 'han4',
'hang': 'hang5', 'hāng': 'hang1', 'háng': 'hang2', 'hǎng': 'hang3', 'hàng': 'hang4', 'hao': 'hao5', 'hāo': 'hao1',
'háo': 'hao2', 'hǎo': 'hao3', 'hào': 'hao4', 'he': 'he5', '': 'he1', '': 'he2', '': 'he3', '': 'he4',
'hei': 'hei5', 'hēi': 'hei1', 'héi': 'hei2', 'hěi': 'hei3', 'hèi': 'hei4', 'hen': 'hen5', 'hēn': 'hen1',
'hén': 'hen2', 'hěn': 'hen3', 'hèn': 'hen4', 'heng': 'heng5', 'hēng': 'heng1', 'héng': 'heng2', 'hěng': 'heng3',
'hèng': 'heng4', 'hong': 'hong5', 'hōng': 'hong1', 'hóng': 'hong2', 'hǒng': 'hong3', 'hòng': 'hong4', 'hou': 'hou5',
'hōu': 'hou1', 'hóu': 'hou2', 'hǒu': 'hou3', 'hòu': 'hou4', 'hu': 'hu5', '': 'hu1', '': 'hu2', '': 'hu3',
'': 'hu4', 'hua': 'hua5', 'huā': 'hua1', 'huá': 'hua2', 'huǎ': 'hua3', 'huà': 'hua4', 'huai': 'huai5',
'huāi': 'huai1', 'huái': 'huai2', 'huǎi': 'huai3', 'huài': 'huai4', 'huan': 'huan5', 'huān': 'huan1',
'huán': 'huan2', 'huǎn': 'huan3', 'huàn': 'huan4', 'huang': 'huang5', 'huāng': 'huang1', 'huáng': 'huang2',
'huǎng': 'huang3', 'huàng': 'huang4', 'hui': 'hui5', 'huī': 'hui1', 'huí': 'hui2', 'huǐ': 'hui3', 'huì': 'hui4',
'hun': 'hun5', 'hūn': 'hun1', 'hún': 'hun2', 'hǔn': 'hun3', 'hùn': 'hun4', 'huo': 'huo5', 'huō': 'huo1',
'huó': 'huo2', 'huǒ': 'huo3', 'huò': 'huo4', 'ji': 'ji5', '': 'ji1', '': 'ji2', '': 'ji3', '': 'ji4',
'jia': 'jia5', 'jiā': 'jia1', 'jiá': 'jia2', 'jiǎ': 'jia3', 'jià': 'jia4', 'jian': 'jian5', 'jiān': 'jian1',
'jián': 'jian2', 'jiǎn': 'jian3', 'jiàn': 'jian4', 'jiang': 'jiang5', 'jiāng': 'jiang1', 'jiáng': 'jiang2',
'jiǎng': 'jiang3', 'jiàng': 'jiang4', 'jiao': 'jiao5', 'jiāo': 'jiao1', 'jiáo': 'jiao2', 'jiǎo': 'jiao3',
'jiào': 'jiao4', 'jie': 'jie5', 'jiē': 'jie1', 'jié': 'jie2', 'jiě': 'jie3', 'jiè': 'jie4', 'jin': 'jin5',
'jīn': 'jin1', 'jín': 'jin2', 'jǐn': 'jin3', 'jìn': 'jin4', 'jing': 'jing5', 'jīng': 'jing1', 'jíng': 'jing2',
'jǐng': 'jing3', 'jìng': 'jing4', 'jiong': 'jiong5', 'jiōng': 'jiong1', 'jióng': 'jiong2', 'jiǒng': 'jiong3',
'jiòng': 'jiong4', 'jiu': 'jiu5', 'jiū': 'jiu1', 'jiú': 'jiu2', 'jiǔ': 'jiu3', 'jiù': 'jiu4', 'ju': 'ju5',
'': 'ju1', '': 'ju2', '': 'ju3', '': 'ju4', 'juan': 'juan5', 'juān': 'juan1', 'juán': 'juan2',
'juǎn': 'juan3', 'juàn': 'juan4', 'jue': 'jue5', 'juē': 'jue1', 'jué': 'jue2', 'juě': 'jue3', 'juè': 'jue4',
'jun': 'jun5', 'jūn': 'jun1', 'jún': 'jun2', 'jǔn': 'jun3', 'jùn': 'jun4', 'ka': 'ka5', '': 'ka1', '': 'ka2',
'': 'ka3', '': 'ka4', 'kai': 'kai5', 'kāi': 'kai1', 'kái': 'kai2', 'kǎi': 'kai3', 'kài': 'kai4', 'kan': 'kan5',
'kān': 'kan1', 'kán': 'kan2', 'kǎn': 'kan3', 'kàn': 'kan4', 'kang': 'kang5', 'kāng': 'kang1', 'káng': 'kang2',
'kǎng': 'kang3', 'kàng': 'kang4', 'kao': 'kao5', 'kāo': 'kao1', 'káo': 'kao2', 'kǎo': 'kao3', 'kào': 'kao4',
'ke': 'ke5', '': 'ke1', '': 'ke2', '': 'ke3', '': 'ke4', 'ken': 'ken5', 'kēn': 'ken1', 'kén': 'ken2',
'kěn': 'ken3', 'kèn': 'ken4', 'keng': 'keng5', 'kēng': 'keng1', 'kéng': 'keng2', 'kěng': 'keng3', 'kèng': 'keng4',
'kong': 'kong5', 'kōng': 'kong1', 'kóng': 'kong2', 'kǒng': 'kong3', 'kòng': 'kong4', 'kou': 'kou5', 'kōu': 'kou1',
'kóu': 'kou2', 'kǒu': 'kou3', 'kòu': 'kou4', 'ku': 'ku5', '': 'ku1', '': 'ku2', '': 'ku3', '': 'ku4',
'kua': 'kua5', 'kuā': 'kua1', 'kuá': 'kua2', 'kuǎ': 'kua3', 'kuà': 'kua4', 'kuai': 'kuai5', 'kuāi': 'kuai1',
'kuái': 'kuai2', 'kuǎi': 'kuai3', 'kuài': 'kuai4', 'kuan': 'kuan5', 'kuān': 'kuan1', 'kuán': 'kuan2',
'kuǎn': 'kuan3', 'kuàn': 'kuan4', 'kuang': 'kuang5', 'kuāng': 'kuang1', 'kuáng': 'kuang2', 'kuǎng': 'kuang3',
'kuàng': 'kuang4', 'kui': 'kui5', 'kuī': 'kui1', 'kuí': 'kui2', 'kuǐ': 'kui3', 'kuì': 'kui4', 'kun': 'kun5',
'kūn': 'kun1', 'kún': 'kun2', 'kǔn': 'kun3', 'kùn': 'kun4', 'kuo': 'kuo5', 'kuō': 'kuo1', 'kuó': 'kuo2',
'kuǒ': 'kuo3', 'kuò': 'kuo4', 'la': 'la5', '': 'la1', '': 'la2', '': 'la3', '': 'la4', 'lai': 'lai5',
'lāi': 'lai1', 'lái': 'lai2', 'lǎi': 'lai3', 'lài': 'lai4', 'lan': 'lan5', 'lān': 'lan1', 'lán': 'lan2',
'lǎn': 'lan3', 'làn': 'lan4', 'lang': 'lang5', 'lāng': 'lang1', 'láng': 'lang2', 'lǎng': 'lang3', 'làng': 'lang4',
'lao': 'lao5', 'lāo': 'lao1', 'láo': 'lao2', 'lǎo': 'lao3', 'lào': 'lao4', 'le': 'le5', '': 'le1', '': 'le2',
'': 'le3', '': 'le4', 'lei': 'lei5', 'lēi': 'lei1', 'léi': 'lei2', 'lěi': 'lei3', 'lèi': 'lei4',
'leng': 'leng5', 'lēng': 'leng1', 'léng': 'leng2', 'lěng': 'leng3', 'lèng': 'leng4', 'li': 'li5', '': 'li1',
'': 'li2', '': 'li3', '': 'li4', 'lia': 'lia5', 'liā': 'lia1', 'liá': 'lia2', 'liǎ': 'lia3', 'lià': 'lia4',
'lian': 'lian5', 'liān': 'lian1', 'lián': 'lian2', 'liǎn': 'lian3', 'liàn': 'lian4', 'liang': 'liang5',
'liāng': 'liang1', 'liáng': 'liang2', 'liǎng': 'liang3', 'liàng': 'liang4', 'liao': 'liao5', 'liāo': 'liao1',
'liáo': 'liao2', 'liǎo': 'liao3', 'liào': 'liao4', 'lie': 'lie5', 'liē': 'lie1', 'lié': 'lie2', 'liě': 'lie3',
'liè': 'lie4', 'lin': 'lin5', 'līn': 'lin1', 'lín': 'lin2', 'lǐn': 'lin3', 'lìn': 'lin4', 'ling': 'ling5',
'līng': 'ling1', 'líng': 'ling2', 'lǐng': 'ling3', 'lìng': 'ling4', 'liu': 'liu5', 'liū': 'liu1', 'liú': 'liu2',
'liǔ': 'liu3', 'liù': 'liu4', 'lo': 'lo5', '': 'lo1', '': 'lo2', '': 'lo3', '': 'lo4', 'long': 'long5',
'lōng': 'long1', 'lóng': 'long2', 'lǒng': 'long3', 'lòng': 'long4', 'lou': 'lou5', 'lōu': 'lou1', 'lóu': 'lou2',
'lǒu': 'lou3', 'lòu': 'lou4', 'lu': 'lu5', '': 'lu1', '': 'lu2', '': 'lu3', '': 'lu4', 'luan': 'luan5',
'luān': 'luan1', 'luán': 'luan2', 'luǎn': 'luan3', 'luàn': 'luan4', 'lun': 'lun5', 'lūn': 'lun1', 'lún': 'lun2',
'lǔn': 'lun3', 'lùn': 'lun4', 'luo': 'luo5', 'luō': 'luo1', 'luó': 'luo2', 'luǒ': 'luo3', 'luò': 'luo4',
'': 'lv5', '': 'lv1', '': 'lv2', '': 'lv3', '': 'lv4', 'lüe': 'lve5', 'lüē': 'lve1', 'lüé': 'lve2',
'lüě': 'lve3', 'lüè': 'lve4', 'ma': 'ma5', '': 'ma1', '': 'ma2', '': 'ma3', '': 'ma4', 'mai': 'mai5',
'māi': 'mai1', 'mái': 'mai2', 'mǎi': 'mai3', 'mài': 'mai4', 'man': 'man5', 'mān': 'man1', 'mán': 'man2',
'mǎn': 'man3', 'màn': 'man4', 'mang': 'mang5', 'māng': 'mang1', 'máng': 'mang2', 'mǎng': 'mang3', 'màng': 'mang4',
'mao': 'mao5', 'māo': 'mao1', 'máo': 'mao2', 'mǎo': 'mao3', 'mào': 'mao4', 'me': 'me5', '': 'me1', '': 'me2',
'': 'me3', '': 'me4', 'mei': 'mei5', 'mēi': 'mei1', 'méi': 'mei2', 'měi': 'mei3', 'mèi': 'mei4', 'men': 'men5',
'mēn': 'men1', 'mén': 'men2', 'měn': 'men3', 'mèn': 'men4', 'meng': 'meng5', 'mēng': 'meng1', 'méng': 'meng2',
'měng': 'meng3', 'mèng': 'meng4', 'mi': 'mi5', '': 'mi1', '': 'mi2', '': 'mi3', '': 'mi4', 'mian': 'mian5',
'miān': 'mian1', 'mián': 'mian2', 'miǎn': 'mian3', 'miàn': 'mian4', 'miao': 'miao5', 'miāo': 'miao1',
'miáo': 'miao2', 'miǎo': 'miao3', 'miào': 'miao4', 'mie': 'mie5', 'miē': 'mie1', 'mié': 'mie2', 'miě': 'mie3',
'miè': 'mie4', 'min': 'min5', 'mīn': 'min1', 'mín': 'min2', 'mǐn': 'min3', 'mìn': 'min4', 'ming': 'ming5',
'mīng': 'ming1', 'míng': 'ming2', 'mǐng': 'ming3', 'mìng': 'ming4', 'miu': 'miu5', 'miū': 'miu1', 'miú': 'miu2',
'miǔ': 'miu3', 'miù': 'miu4', 'mo': 'mo5', '': 'mo1', '': 'mo2', '': 'mo3', '': 'mo4', 'mou': 'mou5',
'mōu': 'mou1', 'móu': 'mou2', 'mǒu': 'mou3', 'mòu': 'mou4', 'mu': 'mu5', '': 'mu1', '': 'mu2', '': 'mu3',
'': 'mu4', 'na': 'na5', '': 'na1', '': 'na2', '': 'na3', '': 'na4', 'nai': 'nai5', 'nāi': 'nai1',
'nái': 'nai2', 'nǎi': 'nai3', 'nài': 'nai4', 'nan': 'nan5', 'nān': 'nan1', 'nán': 'nan2', 'nǎn': 'nan3',
'nàn': 'nan4', 'nang': 'nang5', 'nāng': 'nang1', 'náng': 'nang2', 'nǎng': 'nang3', 'nàng': 'nang4', 'nao': 'nao5',
'nāo': 'nao1', 'náo': 'nao2', 'nǎo': 'nao3', 'nào': 'nao4', 'ne': 'ne5', '': 'ne1', '': 'ne2', '': 'ne3',
'': 'ne4', 'nei': 'nei5', 'nēi': 'nei1', 'néi': 'nei2', 'něi': 'nei3', 'nèi': 'nei4', 'nen': 'nen5',
'nēn': 'nen1', 'nén': 'nen2', 'něn': 'nen3', 'nèn': 'nen4', 'neng': 'neng5', 'nēng': 'neng1', 'néng': 'neng2',
'něng': 'neng3', 'nèng': 'neng4', 'ni': 'ni5', '': 'ni1', '': 'ni2', '': 'ni3', '': 'ni4', 'nian': 'nian5',
'niān': 'nian1', 'nián': 'nian2', 'niǎn': 'nian3', 'niàn': 'nian4', 'niang': 'niang5', 'niāng': 'niang1',
'niáng': 'niang2', 'niǎng': 'niang3', 'niàng': 'niang4', 'niao': 'niao5', 'niāo': 'niao1', 'niáo': 'niao2',
'niǎo': 'niao3', 'niào': 'niao4', 'nie': 'nie5', 'niē': 'nie1', 'nié': 'nie2', 'niě': 'nie3', 'niè': 'nie4',
'nin': 'nin5', 'nīn': 'nin1', 'nín': 'nin2', 'nǐn': 'nin3', 'nìn': 'nin4', 'ning': 'ning5', 'nīng': 'ning1',
'níng': 'ning2', 'nǐng': 'ning3', 'nìng': 'ning4', 'niu': 'niu5', 'niū': 'niu1', 'niú': 'niu2', 'niǔ': 'niu3',
'niù': 'niu4', 'nong': 'nong5', 'nōng': 'nong1', 'nóng': 'nong2', 'nǒng': 'nong3', 'nòng': 'nong4', 'nou': 'nou5',
'nōu': 'nou1', 'nóu': 'nou2', 'nǒu': 'nou3', 'nòu': 'nou4', 'nu': 'nu5', '': 'nu1', '': 'nu2', '': 'nu3',
'': 'nu4', 'nuan': 'nuan5', 'nuān': 'nuan1', 'nuán': 'nuan2', 'nuǎn': 'nuan3', 'nuàn': 'nuan4', 'nuo': 'nuo5',
'nuō': 'nuo1', 'nuó': 'nuo2', 'nuǒ': 'nuo3', 'nuò': 'nuo4', '': 'nv5', '': 'nv1', '': 'nv2', '': 'nv3',
'': 'nv4', 'nüe': 'nve5', 'nüē': 'nve1', 'nüé': 'nve2', 'nüě': 'nve3', 'nüè': 'nve4', 'o': 'o5', 'ō': 'o1',
'ó': 'o2', 'ǒ': 'o3', 'ò': 'o4', 'ou': 'ou5', 'ōu': 'ou1', 'óu': 'ou2', 'ǒu': 'ou3', 'òu': 'ou4', 'pa': 'pa5',
'': 'pa1', '': 'pa2', '': 'pa3', '': 'pa4', 'pai': 'pai5', 'pāi': 'pai1', 'pái': 'pai2', 'pǎi': 'pai3',
'pài': 'pai4', 'pan': 'pan5', 'pān': 'pan1', 'pán': 'pan2', 'pǎn': 'pan3', 'pàn': 'pan4', 'pang': 'pang5',
'pāng': 'pang1', 'páng': 'pang2', 'pǎng': 'pang3', 'pàng': 'pang4', 'pao': 'pao5', 'pāo': 'pao1', 'páo': 'pao2',
'pǎo': 'pao3', 'pào': 'pao4', 'pei': 'pei5', 'pēi': 'pei1', 'péi': 'pei2', 'pěi': 'pei3', 'pèi': 'pei4',
'pen': 'pen5', 'pēn': 'pen1', 'pén': 'pen2', 'pěn': 'pen3', 'pèn': 'pen4', 'peng': 'peng5', 'pēng': 'peng1',
'péng': 'peng2', 'pěng': 'peng3', 'pèng': 'peng4', 'pi': 'pi5', '': 'pi1', '': 'pi2', '': 'pi3', '': 'pi4',
'pian': 'pian5', 'piān': 'pian1', 'pián': 'pian2', 'piǎn': 'pian3', 'piàn': 'pian4', 'piao': 'piao5',
'piāo': 'piao1', 'piáo': 'piao2', 'piǎo': 'piao3', 'piào': 'piao4', 'pie': 'pie5', 'piē': 'pie1', 'pié': 'pie2',
'piě': 'pie3', 'piè': 'pie4', 'pin': 'pin5', 'pīn': 'pin1', 'pín': 'pin2', 'pǐn': 'pin3', 'pìn': 'pin4',
'ping': 'ping5', 'pīng': 'ping1', 'píng': 'ping2', 'pǐng': 'ping3', 'pìng': 'ping4', 'po': 'po5', '': 'po1',
'': 'po2', '': 'po3', '': 'po4', 'pou': 'pou5', 'pōu': 'pou1', 'póu': 'pou2', 'pǒu': 'pou3', 'pòu': 'pou4',
'pu': 'pu5', '': 'pu1', '': 'pu2', '': 'pu3', '': 'pu4', 'qi': 'qi5', '': 'qi1', '': 'qi2', '': 'qi3',
'': 'qi4', 'qia': 'qia5', 'qiā': 'qia1', 'qiá': 'qia2', 'qiǎ': 'qia3', 'qià': 'qia4', 'qian': 'qian5',
'qiān': 'qian1', 'qián': 'qian2', 'qiǎn': 'qian3', 'qiàn': 'qian4', 'qiang': 'qiang5', 'qiāng': 'qiang1',
'qiáng': 'qiang2', 'qiǎng': 'qiang3', 'qiàng': 'qiang4', 'qiao': 'qiao5', 'qiāo': 'qiao1', 'qiáo': 'qiao2',
'qiǎo': 'qiao3', 'qiào': 'qiao4', 'qie': 'qie5', 'qiē': 'qie1', 'qié': 'qie2', 'qiě': 'qie3', 'qiè': 'qie4',
'qin': 'qin5', 'qīn': 'qin1', 'qín': 'qin2', 'qǐn': 'qin3', 'qìn': 'qin4', 'qing': 'qing5', 'qīng': 'qing1',
'qíng': 'qing2', 'qǐng': 'qing3', 'qìng': 'qing4', 'qiong': 'qiong5', 'qiōng': 'qiong1', 'qióng': 'qiong2',
'qiǒng': 'qiong3', 'qiòng': 'qiong4', 'qiu': 'qiu5', 'qiū': 'qiu1', 'qiú': 'qiu2', 'qiǔ': 'qiu3', 'qiù': 'qiu4',
'qu': 'qu5', '': 'qu1', '': 'qu2', '': 'qu3', '': 'qu4', 'quan': 'quan5', 'quān': 'quan1', 'quán': 'quan2',
'quǎn': 'quan3', 'quàn': 'quan4', 'que': 'que5', 'quē': 'que1', 'qué': 'que2', 'quě': 'que3', 'què': 'que4',
'qun': 'qun5', 'qūn': 'qun1', 'qún': 'qun2', 'qǔn': 'qun3', 'qùn': 'qun4', 'ran': 'ran5', 'rān': 'ran1',
'rán': 'ran2', 'rǎn': 'ran3', 'ràn': 'ran4', 'rang': 'rang5', 'rāng': 'rang1', 'ráng': 'rang2', 'rǎng': 'rang3',
'ràng': 'rang4', 'rao': 'rao5', 'rāo': 'rao1', 'ráo': 'rao2', 'rǎo': 'rao3', 'rào': 'rao4', 're': 're5',
'': 're1', '': 're2', '': 're3', '': 're4', 'ren': 'ren5', 'rēn': 'ren1', 'rén': 'ren2', 'rěn': 'ren3',
'rèn': 'ren4', 'reng': 'reng5', 'rēng': 'reng1', 'réng': 'reng2', 'rěng': 'reng3', 'rèng': 'reng4', 'ri': 'ri5',
'': 'ri1', '': 'ri2', '': 'ri3', '': 'ri4', 'rong': 'rong5', 'rōng': 'rong1', 'róng': 'rong2',
'rǒng': 'rong3', 'ròng': 'rong4', 'rou': 'rou5', 'rōu': 'rou1', 'róu': 'rou2', 'rǒu': 'rou3', 'ròu': 'rou4',
'ru': 'ru5', '': 'ru1', '': 'ru2', '': 'ru3', '': 'ru4', 'ruan': 'ruan5', 'ruān': 'ruan1', 'ruán': 'ruan2',
'ruǎn': 'ruan3', 'ruàn': 'ruan4', 'rui': 'rui5', 'ruī': 'rui1', 'ruí': 'rui2', 'ruǐ': 'rui3', 'ruì': 'rui4',
'run': 'run5', 'rūn': 'run1', 'rún': 'run2', 'rǔn': 'run3', 'rùn': 'run4', 'ruo': 'ruo5', 'ruō': 'ruo1',
'ruó': 'ruo2', 'ruǒ': 'ruo3', 'ruò': 'ruo4', 'sa': 'sa5', '': 'sa1', '': 'sa2', '': 'sa3', '': 'sa4',
'sai': 'sai5', 'sāi': 'sai1', 'sái': 'sai2', 'sǎi': 'sai3', 'sài': 'sai4', 'san': 'san5', 'sān': 'san1',
'sán': 'san2', 'sǎn': 'san3', 'sàn': 'san4', 'sang': 'sang5', 'sāng': 'sang1', 'sáng': 'sang2', 'sǎng': 'sang3',
'sàng': 'sang4', 'sao': 'sao5', 'sāo': 'sao1', 'sáo': 'sao2', 'sǎo': 'sao3', 'sào': 'sao4', 'se': 'se5',
'': 'se1', '': 'se2', '': 'se3', '': 'se4', 'sen': 'sen5', 'sēn': 'sen1', 'sén': 'sen2', 'sěn': 'sen3',
'sèn': 'sen4', 'seng': 'seng5', 'sēng': 'seng1', 'séng': 'seng2', 'sěng': 'seng3', 'sèng': 'seng4', 'sha': 'sha5',
'shā': 'sha1', 'shá': 'sha2', 'shǎ': 'sha3', 'shà': 'sha4', 'shai': 'shai5', 'shāi': 'shai1', 'shái': 'shai2',
'shǎi': 'shai3', 'shài': 'shai4', 'shan': 'shan5', 'shān': 'shan1', 'shán': 'shan2', 'shǎn': 'shan3',
'shàn': 'shan4', 'shang': 'shang5', 'shāng': 'shang1', 'sháng': 'shang2', 'shǎng': 'shang3', 'shàng': 'shang4',
'shao': 'shao5', 'shāo': 'shao1', 'sháo': 'shao2', 'shǎo': 'shao3', 'shào': 'shao4', 'she': 'she5', 'shē': 'she1',
'shé': 'she2', 'shě': 'she3', 'shè': 'she4', 'shei': 'shei5', 'shēi': 'shei1', 'shéi': 'shei2', 'shěi': 'shei3',
'shèi': 'shei4', 'shen': 'shen5', 'shēn': 'shen1', 'shén': 'shen2', 'shěn': 'shen3', 'shèn': 'shen4',
'sheng': 'sheng5', 'shēng': 'sheng1', 'shéng': 'sheng2', 'shěng': 'sheng3', 'shèng': 'sheng4', 'shi': 'shi5',
'shī': 'shi1', 'shí': 'shi2', 'shǐ': 'shi3', 'shì': 'shi4', 'shou': 'shou5', 'shōu': 'shou1', 'shóu': 'shou2',
'shǒu': 'shou3', 'shòu': 'shou4', 'shu': 'shu5', 'shū': 'shu1', 'shú': 'shu2', 'shǔ': 'shu3', 'shù': 'shu4',
'shua': 'shua5', 'shuā': 'shua1', 'shuá': 'shua2', 'shuǎ': 'shua3', 'shuà': 'shua4', 'shuai': 'shuai5',
'shuāi': 'shuai1', 'shuái': 'shuai2', 'shuǎi': 'shuai3', 'shuài': 'shuai4', 'shuan': 'shuan5', 'shuān': 'shuan1',
'shuán': 'shuan2', 'shuǎn': 'shuan3', 'shuàn': 'shuan4', 'shuang': 'shuang5', 'shuāng': 'shuang1',
'shuáng': 'shuang2', 'shuǎng': 'shuang3', 'shuàng': 'shuang4', 'shui': 'shui5', 'shuī': 'shui1', 'shuí': 'shui2',
'shuǐ': 'shui3', 'shuì': 'shui4', 'shun': 'shun5', 'shūn': 'shun1', 'shún': 'shun2', 'shǔn': 'shun3',
'shùn': 'shun4', 'shuo': 'shuo5', 'shuō': 'shuo1', 'shuó': 'shuo2', 'shuǒ': 'shuo3', 'shuò': 'shuo4', 'si': 'si5',
'': 'si1', '': 'si2', '': 'si3', '': 'si4', 'song': 'song5', 'sōng': 'song1', 'sóng': 'song2',
'sǒng': 'song3', 'sòng': 'song4', 'sou': 'sou5', 'sōu': 'sou1', 'sóu': 'sou2', 'sǒu': 'sou3', 'sòu': 'sou4',
'su': 'su5', '': 'su1', '': 'su2', '': 'su3', '': 'su4', 'suan': 'suan5', 'suān': 'suan1', 'suán': 'suan2',
'suǎn': 'suan3', 'suàn': 'suan4', 'sui': 'sui5', 'suī': 'sui1', 'suí': 'sui2', 'suǐ': 'sui3', 'suì': 'sui4',
'sun': 'sun5', 'sūn': 'sun1', 'sún': 'sun2', 'sǔn': 'sun3', 'sùn': 'sun4', 'suo': 'suo5', 'suō': 'suo1',
'suó': 'suo2', 'suǒ': 'suo3', 'suò': 'suo4', 'ta': 'ta5', '': 'ta1', '': 'ta2', '': 'ta3', '': 'ta4',
'tai': 'tai5', 'tāi': 'tai1', 'tái': 'tai2', 'tǎi': 'tai3', 'tài': 'tai4', 'tan': 'tan5', 'tān': 'tan1',
'tán': 'tan2', 'tǎn': 'tan3', 'tàn': 'tan4', 'tang': 'tang5', 'tāng': 'tang1', 'táng': 'tang2', 'tǎng': 'tang3',
'tàng': 'tang4', 'tao': 'tao5', 'tāo': 'tao1', 'táo': 'tao2', 'tǎo': 'tao3', 'tào': 'tao4', 'te': 'te5',
'': 'te1', '': 'te2', '': 'te3', '': 'te4', 'teng': 'teng5', 'tēng': 'teng1', 'téng': 'teng2',
'těng': 'teng3', 'tèng': 'teng4', 'ti': 'ti5', '': 'ti1', '': 'ti2', '': 'ti3', '': 'ti4', 'tian': 'tian5',
'tiān': 'tian1', 'tián': 'tian2', 'tiǎn': 'tian3', 'tiàn': 'tian4', 'tiao': 'tiao5', 'tiāo': 'tiao1',
'tiáo': 'tiao2', 'tiǎo': 'tiao3', 'tiào': 'tiao4', 'tie': 'tie5', 'tiē': 'tie1', 'tié': 'tie2', 'tiě': 'tie3',
'tiè': 'tie4', 'ting': 'ting5', 'tīng': 'ting1', 'tíng': 'ting2', 'tǐng': 'ting3', 'tìng': 'ting4', 'tong': 'tong5',
'tōng': 'tong1', 'tóng': 'tong2', 'tǒng': 'tong3', 'tòng': 'tong4', 'tou': 'tou5', 'tōu': 'tou1', 'tóu': 'tou2',
'tǒu': 'tou3', 'tòu': 'tou4', 'tu': 'tu5', '': 'tu1', '': 'tu2', '': 'tu3', '': 'tu4', 'tuan': 'tuan5',
'tuān': 'tuan1', 'tuán': 'tuan2', 'tuǎn': 'tuan3', 'tuàn': 'tuan4', 'tui': 'tui5', 'tuī': 'tui1', 'tuí': 'tui2',
'tuǐ': 'tui3', 'tuì': 'tui4', 'tun': 'tun5', 'tūn': 'tun1', 'tún': 'tun2', 'tǔn': 'tun3', 'tùn': 'tun4',
'tuo': 'tuo5', 'tuō': 'tuo1', 'tuó': 'tuo2', 'tuǒ': 'tuo3', 'tuò': 'tuo4', 'wa': 'wa5', '': 'wa1', '': 'wa2',
'': 'wa3', '': 'wa4', 'wai': 'wai5', 'wāi': 'wai1', 'wái': 'wai2', 'wǎi': 'wai3', 'wài': 'wai4', 'wan': 'wan5',
'wān': 'wan1', 'wán': 'wan2', 'wǎn': 'wan3', 'wàn': 'wan4', 'wang': 'wang5', 'wāng': 'wang1', 'wáng': 'wang2',
'wǎng': 'wang3', 'wàng': 'wang4', 'wei': 'wei5', 'wēi': 'wei1', 'wéi': 'wei2', 'wěi': 'wei3', 'wèi': 'wei4',
'wen': 'wen5', 'wēn': 'wen1', 'wén': 'wen2', 'wěn': 'wen3', 'wèn': 'wen4', 'weng': 'weng5', 'wēng': 'weng1',
'wéng': 'weng2', 'wěng': 'weng3', 'wèng': 'weng4', 'wo': 'wo5', '': 'wo1', '': 'wo2', '': 'wo3', '': 'wo4',
'wu': 'wu5', '': 'wu1', '': 'wu2', '': 'wu3', '': 'wu4', 'xi': 'xi5', '': 'xi1', '': 'xi2', '': 'xi3',
'': 'xi4', 'xia': 'xia5', 'xiā': 'xia1', 'xiá': 'xia2', 'xiǎ': 'xia3', 'xià': 'xia4', 'xian': 'xian5',
'xiān': 'xian1', 'xián': 'xian2', 'xiǎn': 'xian3', 'xiàn': 'xian4', 'xiang': 'xiang5', 'xiāng': 'xiang1',
'xiáng': 'xiang2', 'xiǎng': 'xiang3', 'xiàng': 'xiang4', 'xiao': 'xiao5', 'xiāo': 'xiao1', 'xiáo': 'xiao2',
'xiǎo': 'xiao3', 'xiào': 'xiao4', 'xie': 'xie5', 'xiē': 'xie1', 'xié': 'xie2', 'xiě': 'xie3', 'xiè': 'xie4',
'xin': 'xin5', 'xīn': 'xin1', 'xín': 'xin2', 'xǐn': 'xin3', 'xìn': 'xin4', 'xing': 'xing5', 'xīng': 'xing1',
'xíng': 'xing2', 'xǐng': 'xing3', 'xìng': 'xing4', 'xiong': 'xiong5', 'xiōng': 'xiong1', 'xióng': 'xiong2',
'xiǒng': 'xiong3', 'xiòng': 'xiong4', 'xiu': 'xiu5', 'xiū': 'xiu1', 'xiú': 'xiu2', 'xiǔ': 'xiu3', 'xiù': 'xiu4',
'xu': 'xu5', '': 'xu1', '': 'xu2', '': 'xu3', '': 'xu4', 'xuan': 'xuan5', 'xuān': 'xuan1', 'xuán': 'xuan2',
'xuǎn': 'xuan3', 'xuàn': 'xuan4', 'xue': 'xue5', 'xuē': 'xue1', 'xué': 'xue2', 'xuě': 'xue3', 'xuè': 'xue4',
'xun': 'xun5', 'xūn': 'xun1', 'xún': 'xun2', 'xǔn': 'xun3', 'xùn': 'xun4', 'ya': 'ya5', '': 'ya1', '': 'ya2',
'': 'ya3', '': 'ya4', 'yan': 'yan5', 'yān': 'yan1', 'yán': 'yan2', 'yǎn': 'yan3', 'yàn': 'yan4',
'yang': 'yang5', 'yāng': 'yang1', 'yáng': 'yang2', 'yǎng': 'yang3', 'yàng': 'yang4', 'yao': 'yao5', 'yāo': 'yao1',
'yáo': 'yao2', 'yǎo': 'yao3', 'yào': 'yao4', 'ye': 'ye5', '': 'ye1', '': 'ye2', '': 'ye3', '': 'ye4',
'yi': 'yi5', '': 'yi1', '': 'yi2', '': 'yi3', '': 'yi4', 'yin': 'yin5', 'yīn': 'yin1', 'yín': 'yin2',
'yǐn': 'yin3', 'yìn': 'yin4', 'ying': 'ying5', 'yīng': 'ying1', 'yíng': 'ying2', 'yǐng': 'ying3', 'yìng': 'ying4',
'yo': 'yo5', '': 'yo1', '': 'yo2', '': 'yo3', '': 'yo4', 'yong': 'yong5', 'yōng': 'yong1', 'yóng': 'yong2',
'yǒng': 'yong3', 'yòng': 'yong4', 'you': 'you5', 'yōu': 'you1', 'yóu': 'you2', 'yǒu': 'you3', 'yòu': 'you4',
'yu': 'yu5', '': 'yu1', '': 'yu2', '': 'yu3', '': 'yu4', 'yuan': 'yuan5', 'yuān': 'yuan1', 'yuán': 'yuan2',
'yuǎn': 'yuan3', 'yuàn': 'yuan4', 'yue': 'yue5', 'yuē': 'yue1', 'yué': 'yue2', 'yuě': 'yue3', 'yuè': 'yue4',
'yun': 'yun5', 'yūn': 'yun1', 'yún': 'yun2', 'yǔn': 'yun3', 'yùn': 'yun4', 'za': 'za5', '': 'za1', '': 'za2',
'': 'za3', '': 'za4', 'zai': 'zai5', 'zāi': 'zai1', 'zái': 'zai2', 'zǎi': 'zai3', 'zài': 'zai4', 'zan': 'zan5',
'zān': 'zan1', 'zán': 'zan2', 'zǎn': 'zan3', 'zàn': 'zan4', 'zang': 'zang5', 'zāng': 'zang1', 'záng': 'zang2',
'zǎng': 'zang3', 'zàng': 'zang4', 'zao': 'zao5', 'zāo': 'zao1', 'záo': 'zao2', 'zǎo': 'zao3', 'zào': 'zao4',
'ze': 'ze5', '': 'ze1', '': 'ze2', '': 'ze3', '': 'ze4', 'zei': 'zei5', 'zēi': 'zei1', 'zéi': 'zei2',
'zěi': 'zei3', 'zèi': 'zei4', 'zen': 'zen5', 'zēn': 'zen1', 'zén': 'zen2', 'zěn': 'zen3', 'zèn': 'zen4',
'zeng': 'zeng5', 'zēng': 'zeng1', 'zéng': 'zeng2', 'zěng': 'zeng3', 'zèng': 'zeng4', 'zha': 'zha5', 'zhā': 'zha1',
'zhá': 'zha2', 'zhǎ': 'zha3', 'zhà': 'zha4', 'zhai': 'zhai5', 'zhāi': 'zhai1', 'zhái': 'zhai2', 'zhǎi': 'zhai3',
'zhài': 'zhai4', 'zhan': 'zhan5', 'zhān': 'zhan1', 'zhán': 'zhan2', 'zhǎn': 'zhan3', 'zhàn': 'zhan4',
'zhang': 'zhang5', 'zhāng': 'zhang1', 'zháng': 'zhang2', 'zhǎng': 'zhang3', 'zhàng': 'zhang4', 'zhao': 'zhao5',
'zhāo': 'zhao1', 'zháo': 'zhao2', 'zhǎo': 'zhao3', 'zhào': 'zhao4', 'zhe': 'zhe5', 'zhē': 'zhe1', 'zhé': 'zhe2',
'zhě': 'zhe3', 'zhè': 'zhe4', 'zhen': 'zhen5', 'zhēn': 'zhen1', 'zhén': 'zhen2', 'zhěn': 'zhen3', 'zhèn': 'zhen4',
'zheng': 'zheng5', 'zhēng': 'zheng1', 'zhéng': 'zheng2', 'zhěng': 'zheng3', 'zhèng': 'zheng4', 'zhi': 'zhi5',
'zhī': 'zhi1', 'zhí': 'zhi2', 'zhǐ': 'zhi3', 'zhì': 'zhi4', 'zhong': 'zhong5', 'zhōng': 'zhong1', 'zhóng': 'zhong2',
'zhǒng': 'zhong3', 'zhòng': 'zhong4', 'zhou': 'zhou5', 'zhōu': 'zhou1', 'zhóu': 'zhou2', 'zhǒu': 'zhou3',
'zhòu': 'zhou4', 'zhu': 'zhu5', 'zhū': 'zhu1', 'zhú': 'zhu2', 'zhǔ': 'zhu3', 'zhù': 'zhu4', 'zhua': 'zhua5',
'zhuā': 'zhua1', 'zhuá': 'zhua2', 'zhuǎ': 'zhua3', 'zhuà': 'zhua4', 'zhuai': 'zhuai5', 'zhuāi': 'zhuai1',
'zhuái': 'zhuai2', 'zhuǎi': 'zhuai3', 'zhuài': 'zhuai4', 'zhuan': 'zhuan5', 'zhuān': 'zhuan1', 'zhuán': 'zhuan2',
'zhuǎn': 'zhuan3', 'zhuàn': 'zhuan4', 'zhuang': 'zhuang5', 'zhuāng': 'zhuang1', 'zhuáng': 'zhuang2',
'zhuǎng': 'zhuang3', 'zhuàng': 'zhuang4', 'zhui': 'zhui5', 'zhuī': 'zhui1', 'zhuí': 'zhui2', 'zhuǐ': 'zhui3',
'zhuì': 'zhui4', 'zhun': 'zhun5', 'zhūn': 'zhun1', 'zhún': 'zhun2', 'zhǔn': 'zhun3', 'zhùn': 'zhun4',
'zhuo': 'zhuo5', 'zhuō': 'zhuo1', 'zhuó': 'zhuo2', 'zhuǒ': 'zhuo3', 'zhuò': 'zhuo4', 'zi': 'zi5', '': 'zi1',
'': 'zi2', '': 'zi3', '': 'zi4', 'zong': 'zong5', 'zōng': 'zong1', 'zóng': 'zong2', 'zǒng': 'zong3',
'zòng': 'zong4', 'zou': 'zou5', 'zōu': 'zou1', 'zóu': 'zou2', 'zǒu': 'zou3', 'zòu': 'zou4', 'zu': 'zu5',
'': 'zu1', '': 'zu2', '': 'zu3', '': 'zu4', 'zuan': 'zuan5', 'zuān': 'zuan1', 'zuán': 'zuan2',
'zuǎn': 'zuan3', 'zuàn': 'zuan4', 'zui': 'zui5', 'zuī': 'zui1', 'zuí': 'zui2', 'zuǐ': 'zui3', 'zuì': 'zui4',
'zun': 'zun5', 'zūn': 'zun1', 'zún': 'zun2', 'zǔn': 'zun3', 'zùn': 'zun4', 'zuo': 'zuo5', 'zuō': 'zuo1',
'zuó': 'zuo2', 'zuǒ': 'zuo3', 'zuò': 'zuo4', 'zhei': 'zhei5', 'zhēi': 'zhei1', 'zhéi': 'zhei2', 'zhěi': 'zhei3',
'zhèi': 'zhei4', 'kei': 'kei5', 'kēi': 'kei1', 'kéi': 'kei2', 'kěi': 'kei3', 'kèi': 'kei4', 'tei': 'tei5',
'tēi': 'tei1', 'téi': 'tei2', 'těi': 'tei3', 'tèi': 'tei4', 'len': 'len5', 'lēn': 'len1', 'lén': 'len2',
'lěn': 'len3', 'lèn': 'len4', 'nun': 'nun5', 'nūn': 'nun1', 'nún': 'nun2', 'nǔn': 'nun3', 'nùn': 'nun4',
'nia': 'nia5', 'niā': 'nia1', 'niá': 'nia2', 'niǎ': 'nia3', 'nià': 'nia4', 'rua': 'rua5', 'ruā': 'rua1',
'ruá': 'rua2', 'ruǎ': 'rua3', 'ruà': 'rua4', 'fiao': 'fiao5', 'fiāo': 'fiao1', 'fiáo': 'fiao2', 'fiǎo': 'fiao3',
'fiào': 'fiao4', 'cei': 'cei5', 'cēi': 'cei1', 'céi': 'cei2', 'cěi': 'cei3', 'cèi': 'cei4', 'wong': 'wong5',
'wōng': 'wong1', 'wóng': 'wong2', 'wǒng': 'wong3', 'wòng': 'wong4', 'din': 'din5', 'dīn': 'din1', 'dín': 'din2',
'dǐn': 'din3', 'dìn': 'din4', 'chua': 'chua5', 'chuā': 'chua1', 'chuá': 'chua2', 'chuǎ': 'chua3', 'chuà': 'chua4',
'n': 'n5', 'n1': 'n1', 'ń': 'n2', 'ň': 'n3', 'ǹ': 'n4', 'ng': 'ng5', 'ng1': 'ng1', 'ńg': 'ng2', 'ňg': 'ng3',
'ǹg': 'ng4'}
shengyundiao2guobiao_dict = {v: k for k, v in guobiao2shengyundiao_dict.items()}
def guobiao2shengyundiao(pinyin_list):
"""国标样式拼音转为声母韵母音调样式的拼音。"""
out = []
for pin in pinyin_list:
out.append(guobiao2shengyundiao_dict.get(pin))
return out
def shengyundiao2guobiao(pinyin_list):
"""声母韵母音调样式的拼音转为国标样式的拼音。"""
out = []
for pin in pinyin_list:
out.append(shengyundiao2guobiao_dict.get(pin))
return out
if __name__ == "__main__":
logger.info(__file__)
out = shengyundiao2guobiao('ni2 hao3 a5'.split())
assert out == ['', 'hǎo', 'a']
out = guobiao2shengyundiao(out)
assert out == ['ni2', 'hao3', 'a5']

@ -0,0 +1,78 @@
#!usr/bin/env python
# -*- coding: utf-8 -*-
# author: kuangdd
# date: 2020/2/16
"""
#### symbol
音素标记
中文音素简单英文音素简单中文音素
"""
_pad = '_' # 填充符
_eos = '~' # 结束符
_chain = '-' # 连接符,连接读音单位
_oov = '*'
# 中文音素表
# 声母27
_shengmu = [
'aa', 'b', 'c', 'ch', 'd', 'ee', 'f', 'g', 'h', 'ii', 'j', 'k', 'l', 'm', 'n', 'oo', 'p', 'q', 'r', 's', 'sh',
't', 'uu', 'vv', 'x', 'z', 'zh'
]
# 韵母41
_yunmu = [
'a', 'ai', 'an', 'ang', 'ao', 'e', 'ei', 'en', 'eng', 'er', 'i', 'ia', 'ian', 'iang', 'iao', 'ie', 'in', 'ing',
'iong', 'iu', 'ix', 'iy', 'iz', 'o', 'ong', 'ou', 'u', 'ua', 'uai', 'uan', 'uang', 'ueng', 'ui', 'un', 'uo', 'v',
'van', 've', 'vn', 'ng', 'uong'
]
# 声调5
_shengdiao = ['1', '2', '3', '4', '5']
# 字母26
_alphabet = 'Aa Bb Cc Dd Ee Ff Gg Hh Ii Jj Kk Ll Mm Nn Oo Pp Qq Rr Ss Tt Uu Vv Ww Xx Yy Zz'.split()
# 英文26
_english = 'A B C D E F G H I J K L M N O P Q R S T U V W X Y Z'.split()
# 标点10
_biaodian = '! ? . , ; : " # ( )'.split()
# 注:!=!|?=?|.=.。|,=,,、|;=;|:=:|"="“|#= \t|(=([{{【<《|)=)]}}】>》
# 其他7
_other = 'w y 0 6 7 8 9'.split()
# 大写字母26
_upper = list('ABCDEFGHIJKLMNOPQRSTUVWXYZ')
# 小写字母26
_lower = list('abcdefghijklmnopqrstuvwxyz')
# 标点符号12
_punctuation = list('!\'"(),-.:;? ')
# 数字10
_digit = list('0123456789')
# 字母和符号64
# 用于英文:ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz!\'"(),-.:;?\s
_character_en = _upper + _lower + _punctuation
# 字母、数字和符号74
# 用于英文或中文:ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz!\'"(),-.:;?\s0123456789
_character_cn = _upper + _lower + _punctuation + _digit
# 中文音素145
# 支持中文环境、英文环境、中英混合环境,中文把文字转为清华大学标准的音素表示
symbol_chinese = [_pad, _eos, _chain] + _shengmu + _yunmu + _shengdiao + _alphabet + _english + _biaodian + _other
# 简单英文音素66
# 支持英文环境
# ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz!\'"(),-.:;?\s
symbol_english_simple = [_pad, _eos] + _upper + _lower + _punctuation
# 简单中文音素76
# 支持英文、中文环境,中文把文字转为拼音字符串
# ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz!\'"(),-.:;?\s0123456789
symbol_chinese_simple = [_pad, _eos] + _upper + _lower + _punctuation + _digit

@ -0,0 +1,19 @@
Copyright (c) 2017 Keith Ito
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in
all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
THE SOFTWARE.

@ -0,0 +1,116 @@
"""
### english
from https://github.com/keithito/tacotron "
Cleaners are transformations that run over the input text at both training and eval time.
Cleaners can be selected by passing a comma-delimited list of cleaner names as the "cleaners"
hyperparameter. Some cleaners are English-specific. You'll typically want to use:
1. "english_cleaners" for English text
2. "transliteration_cleaners" for non-English text that can be transliterated to ASCII using
the Unidecode library (https://pypi.python.org/pypi/Unidecode)
3. "basic_cleaners" if you do not want to transliterate (in this case, you should also update
the symbols in symbols.py to match your data).
"""
import re
import random
from . import cleaners
from .symbols import symbols
# Mappings from symbol to numeric ID and vice versa:
_symbol_to_id = {s: i for i, s in enumerate(symbols)}
_id_to_symbol = {i: s for i, s in enumerate(symbols)}
# Regular expression matching text enclosed in curly braces:
_curly_re = re.compile(r'(.*?)\{(.+?)\}(.*)')
def get_arpabet(word, dictionary):
word_arpabet = dictionary.lookup(word)
if word_arpabet is not None:
return "{" + word_arpabet[0] + "}"
else:
return word
def text_to_sequence(text, cleaner_names, dictionary=None, p_arpabet=1.0):
'''Converts a string of text to a sequence of IDs corresponding to the symbols in the text.
The text can optionally have ARPAbet sequences enclosed in curly braces embedded
in it. For example, "Turn left on {HH AW1 S S T AH0 N} Street."
Args:
text: string to convert to a sequence
cleaner_names: names of the cleaner functions to run the text through
dictionary: arpabet class with arpabet dictionary
Returns:
List of integers corresponding to the symbols in the text
'''
sequence = []
space = _symbols_to_sequence(' ')
# Check for curly braces and treat their contents as ARPAbet:
while len(text):
m = _curly_re.match(text)
if not m:
clean_text = _clean_text(text, cleaner_names)
if dictionary is not None:
clean_text = [get_arpabet(w, dictionary)
if random.random() < p_arpabet else w
for w in clean_text.split(" ")]
for i in range(len(clean_text)):
t = clean_text[i]
if t.startswith("{"):
sequence += _arpabet_to_sequence(t[1:-1])
else:
sequence += _symbols_to_sequence(t)
sequence += space
else:
sequence += _symbols_to_sequence(clean_text)
break
clean_text = _clean_text(text, cleaner_names)
sequence += _symbols_to_sequence(_clean_text(m.group(1), cleaner_names))
sequence += _arpabet_to_sequence(m.group(2))
text = m.group(3)
# remove trailing space
sequence = sequence[:-1] if sequence[-1] == space[0] else sequence
return sequence
def sequence_to_text(sequence):
'''Converts a sequence of IDs back to a string'''
result = []
for symbol_id in sequence:
if symbol_id in _id_to_symbol:
s = _id_to_symbol[symbol_id]
# Enclose ARPAbet back in curly braces:
if len(s) > 1 and s[0] == '@':
s = '{%s}' % s[1:]
result.append(s)
result = ''.join(result)
return result.replace('}{', ' ')
def _clean_text(text, cleaner_names):
for name in cleaner_names:
cleaner = getattr(cleaners, name)
if not cleaner:
raise Exception('Unknown cleaner: %s' % name)
text = cleaner(text)
return text
def _symbols_to_sequence(symbols):
return [_symbol_to_id[s] for s in symbols if _should_keep_symbol(s)]
def _arpabet_to_sequence(text):
return _symbols_to_sequence(['@' + s for s in text.split()])
def _should_keep_symbol(s):
return s in _symbol_to_id and s is not '_' and s is not '~'

@ -0,0 +1,91 @@
'''
### english
from https://github.com/keithito/tacotron "
Cleaners are transformations that run over the input text at both training and eval time.
Cleaners can be selected by passing a comma-delimited list of cleaner names as the "cleaners"
hyperparameter. Some cleaners are English-specific. You'll typically want to use:
1. "english_cleaners" for English text
2. "transliteration_cleaners" for non-English text that can be transliterated to ASCII using
the Unidecode library (https://pypi.python.org/pypi/Unidecode)
3. "basic_cleaners" if you do not want to transliterate (in this case, you should also update
the symbols in symbols.py to match your data).
'''
import re
from unidecode import unidecode
from .numbers import normalize_numbers
# Regular expression matching whitespace:
_whitespace_re = re.compile(r'\s+')
# List of (regular expression, replacement) pairs for abbreviations:
_abbreviations = [(re.compile('\\b%s\\.' % x[0], re.IGNORECASE), x[1]) for x in [
('mrs', 'misess'),
('mr', 'mister'),
('dr', 'doctor'),
('st', 'saint'),
('co', 'company'),
('jr', 'junior'),
('maj', 'major'),
('gen', 'general'),
('drs', 'doctors'),
('rev', 'reverend'),
('lt', 'lieutenant'),
('hon', 'honorable'),
('sgt', 'sergeant'),
('capt', 'captain'),
('esq', 'esquire'),
('ltd', 'limited'),
('col', 'colonel'),
('ft', 'fort'),
]]
def expand_abbreviations(text):
for regex, replacement in _abbreviations:
text = re.sub(regex, replacement, text)
return text
def expand_numbers(text):
return normalize_numbers(text)
def lowercase(text):
return text.lower()
def collapse_whitespace(text):
return re.sub(_whitespace_re, ' ', text)
def convert_to_ascii(text):
return unidecode(text)
def basic_cleaners(text):
'''Basic pipeline that lowercases and collapses whitespace without transliteration.'''
text = lowercase(text)
text = collapse_whitespace(text)
return text
def transliteration_cleaners(text):
'''Pipeline for non-English text that transliterates to ASCII.'''
text = convert_to_ascii(text)
text = lowercase(text)
text = collapse_whitespace(text)
return text
def english_cleaners(text):
'''Pipeline for English text, including number and abbreviation expansion.'''
text = convert_to_ascii(text)
text = lowercase(text)
text = expand_numbers(text)
text = expand_abbreviations(text)
text = collapse_whitespace(text)
return text

File diff suppressed because it is too large Load Diff

@ -0,0 +1,65 @@
""" from https://github.com/keithito/tacotron """
import re
valid_symbols = [
'AA', 'AA0', 'AA1', 'AA2', 'AE', 'AE0', 'AE1', 'AE2', 'AH', 'AH0', 'AH1', 'AH2',
'AO', 'AO0', 'AO1', 'AO2', 'AW', 'AW0', 'AW1', 'AW2', 'AY', 'AY0', 'AY1', 'AY2',
'B', 'CH', 'D', 'DH', 'EH', 'EH0', 'EH1', 'EH2', 'ER', 'ER0', 'ER1', 'ER2', 'EY',
'EY0', 'EY1', 'EY2', 'F', 'G', 'HH', 'IH', 'IH0', 'IH1', 'IH2', 'IY', 'IY0', 'IY1',
'IY2', 'JH', 'K', 'L', 'M', 'N', 'NG', 'OW', 'OW0', 'OW1', 'OW2', 'OY', 'OY0',
'OY1', 'OY2', 'P', 'R', 'S', 'SH', 'T', 'TH', 'UH', 'UH0', 'UH1', 'UH2', 'UW',
'UW0', 'UW1', 'UW2', 'V', 'W', 'Y', 'Z', 'ZH'
]
_valid_symbol_set = set(valid_symbols)
class CMUDict:
'''Thin wrapper around CMUDict data. http://www.speech.cs.cmu.edu/cgi-bin/cmudict'''
def __init__(self, file_or_path, keep_ambiguous=True):
if isinstance(file_or_path, str):
with open(file_or_path, encoding='latin-1') as f:
entries = _parse_cmudict(f)
else:
entries = _parse_cmudict(file_or_path)
if not keep_ambiguous:
entries = {word: pron for word, pron in entries.items() if len(pron) == 1}
self._entries = entries
def __len__(self):
return len(self._entries)
def lookup(self, word):
'''Returns list of ARPAbet pronunciations of the given word.'''
return self._entries.get(word.upper())
_alt_re = re.compile(r'\([0-9]+\)')
def _parse_cmudict(file):
cmudict = {}
for line in file:
if len(line) and (line[0] >= 'A' and line[0] <= 'Z' or line[0] == "'"):
parts = line.split(' ')
word = re.sub(_alt_re, '', parts[0])
pronunciation = _get_pronunciation(parts[1])
if pronunciation:
if word in cmudict:
cmudict[word].append(pronunciation)
else:
cmudict[word] = [pronunciation]
return cmudict
def _get_pronunciation(s):
parts = s.strip().split(' ')
for part in parts:
if part not in _valid_symbol_set:
return None
return ' '.join(parts)

@ -0,0 +1,71 @@
""" from https://github.com/keithito/tacotron """
import inflect
import re
_inflect = inflect.engine()
_comma_number_re = re.compile(r'([0-9][0-9\,]+[0-9])')
_decimal_number_re = re.compile(r'([0-9]+\.[0-9]+)')
_pounds_re = re.compile(r'£([0-9\,]*[0-9]+)')
_dollars_re = re.compile(r'\$([0-9\.\,]*[0-9]+)')
_ordinal_re = re.compile(r'[0-9]+(st|nd|rd|th)')
_number_re = re.compile(r'[0-9]+')
def _remove_commas(m):
return m.group(1).replace(',', '')
def _expand_decimal_point(m):
return m.group(1).replace('.', ' point ')
def _expand_dollars(m):
match = m.group(1)
parts = match.split('.')
if len(parts) > 2:
return match + ' dollars' # Unexpected format
dollars = int(parts[0]) if parts[0] else 0
cents = int(parts[1]) if len(parts) > 1 and parts[1] else 0
if dollars and cents:
dollar_unit = 'dollar' if dollars == 1 else 'dollars'
cent_unit = 'cent' if cents == 1 else 'cents'
return '%s %s, %s %s' % (dollars, dollar_unit, cents, cent_unit)
elif dollars:
dollar_unit = 'dollar' if dollars == 1 else 'dollars'
return '%s %s' % (dollars, dollar_unit)
elif cents:
cent_unit = 'cent' if cents == 1 else 'cents'
return '%s %s' % (cents, cent_unit)
else:
return 'zero dollars'
def _expand_ordinal(m):
return _inflect.number_to_words(m.group(0))
def _expand_number(m):
num = int(m.group(0))
if num > 1000 and num < 3000:
if num == 2000:
return 'two thousand'
elif num > 2000 and num < 2010:
return 'two thousand ' + _inflect.number_to_words(num % 100)
elif num % 100 == 0:
return _inflect.number_to_words(num // 100) + ' hundred'
else:
return _inflect.number_to_words(num, andword='', zero='oh', group=2).replace(', ', ' ')
else:
return _inflect.number_to_words(num, andword='')
def normalize_numbers(text):
text = re.sub(_comma_number_re, _remove_commas, text)
text = re.sub(_pounds_re, r'\1 pounds', text)
text = re.sub(_dollars_re, _expand_dollars, text)
text = re.sub(_decimal_number_re, _expand_decimal_point, text)
text = re.sub(_ordinal_re, _expand_ordinal, text)
text = re.sub(_number_re, _expand_number, text)
return text

@ -0,0 +1,21 @@
""" from https://github.com/keithito/tacotron """
'''
Defines the set of symbols used in text input to the model.
The default is a set of ASCII characters that works well for English or text that has been run through Unidecode. For other data, you can modify _characters. See TRAINING_DATA.md for details. '''
from . import cmudict
_punctuation = '!\'",.:;? '
_math = '#%&*+-/[]()'
_special = '_@©°½—₩€$'
_accented = 'áçéêëñöøćž'
_numbers = '0123456789'
_letters = 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'
# Prepend "@" to ARPAbet symbols to ensure uniqueness (some are the same as
# uppercase letters):
_arpabet = ['@' + s for s in cmudict.valid_symbols]
# Export all symbols:
symbols = list(_punctuation + _math + _special + _accented + _numbers + _letters) + _arpabet

@ -0,0 +1,50 @@
"""
### pinyinkit
文本转拼音的模块依赖python-pinyinjiebaphrase-pinyin-data模块
"""
import re
from pypinyin import lazy_pinyin, Style
# 兼容0.1.0之前的版本。
# 音调5为轻声
_diao_re = re.compile(r"([12345]$)")
def text2pinyin(text, errors=None, **kwargs):
"""
汉语文本转为拼音列表
:param text: str,汉语文本字符串
:param errors: function,对转拼音失败的字符的处理函数默认保留原样
:return: list,拼音列表
"""
if errors is None:
errors = default_errors
pin = lazy_pinyin(text, style=Style.TONE3, errors=errors, strict=True, neutral_tone_with_five=True, **kwargs)
return pin
def default_errors(x):
return list(x)
def split_pinyin(py):
"""
单个拼音转为音素列表
:param py: str,拼音字符串
:param errors: function,对OOV拼音的处理函数默认保留原样
:return: list,音素列表
"""
parts = _diao_re.split(py)
if len(parts) == 1:
fuyuan = py
diao = "5"
else:
fuyuan = parts[0]
diao = parts[1]
return [fuyuan, diao]
if __name__ == "__main__":
print(__file__)
assert text2pinyin("拼音") == ['pin1', 'yin1']
assert text2pinyin("汉字,a1") == ['han4', 'zi4', ',', 'a', '1']

@ -0,0 +1,4 @@
jieba
inflect
unidecode
tqdm

@ -0,0 +1,44 @@
#!usr/bin/env python
# -*- coding: utf-8 -*-
# author: kuangdd
# date: 2019/12/1
"""
local
"""
import logging
logging.basicConfig(level=logging.INFO)
def run_text2phoneme():
from phkit.chinese.sequence import text2phoneme, text2sequence
text = "汉字转音素TTS《Text to speech》。"
# text = "岂有此理"
# text = "我的儿子玩会儿"
out = text2phoneme(text)
print(out)
# ['h', 'an', '4', '-', 'z', 'iy', '4', '-', 'zh', 'uan', '3', '-', 'ii', 'in', '1', '-', 's', 'u', '4', '-', ',',
# 'Tt', 'Tt', 'Ss', ':', '(', 'T', 'E', 'X', 'T', '#', 'T', 'O', '#', 'S', 'P', 'E', 'E', 'C', 'H', ')', '.', '-',
# '~', '_']
out = text2sequence(text)
print(out)
# [11, 32, 76, 2, 28, 51, 76, 2, 29, 59, 75, 2, 12, 46, 73, 2, 22, 56, 76, 2, 133, 97, 97, 96, 135, 138, 123, 108,
# 127, 123, 137, 123, 118, 137, 122, 119, 108, 108, 106, 111, 139, 132, 2, 1, 0]
def run_english():
from phkit.english import text_to_sequence, sequence_to_text
from phkit.english.cmudict import CMUDict
text = "text to speech"
cmupath = 'phkit/english/cmu_dictionary'
cmudict = CMUDict(cmupath)
seq = text_to_sequence(text, cleaner_names=["english_cleaners"], dictionary=cmudict)
print(seq)
txt = sequence_to_text(seq)
print(txt)
if __name__ == "__main__":
print(__file__)
run_text2phoneme()
run_english()

@ -0,0 +1,86 @@
#!usr/bin/env python
# -*- coding: utf-8 -*-
# author: kuangdd
# date: 2019/12/15
"""
语音处理工具箱
生成whl格式安装包python setup.py bdist_wheel
直接上传pypipython setup.py sdist upload
用twine上传pypi
生成安装包python setup.py sdist
上传安装包twine upload dist/phkit-0.0.3.tar.gz
注意需要在home目录下建立.pypirc配置文件文件内容格式
[distutils]
index-servers=pypi
[pypi]
repository = https://upload.pypi.org/legacy/
username: admin
password: admin
"""
from setuptools import setup, find_packages
import os
import logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(os.path.splitext(os.path.basename(__name__))[0])
install_requires = ['jieba>=0.42.1', 'tqdm', 'inflect', 'unidecode']
requires = install_requires
def create_readme():
from phkit import readme_docs
docs = []
with open("README.md", "wt", encoding="utf8") as fout:
for doc in readme_docs:
fout.write(doc)
docs.append(doc)
return "".join(docs)
def pip_install():
for pkg in install_requires + requires:
try:
os.system("pip install {}".format(pkg))
except Exception as e:
logger.info("pip install {} failed".format(pkg))
pip_install()
phkit_doc = create_readme()
from phkit import __version__ as phkit_version
setup(
name="phkit",
version=phkit_version,
author="kuangdd",
author_email="kuangdd@foxmail.com",
description="phoneme toolkit",
long_description=phkit_doc,
long_description_content_type="text/markdown",
url="https://github.com/KuangDD/phkit",
packages=find_packages(exclude=['contrib', 'docs', 'tests*']),
install_requires=install_requires, # 指定项目最低限度需要运行的依赖项
python_requires='>=3.5', # python的依赖关系
package_data={
'txt': ['requirements.txt'],
'md': ['**/*.md', '*.md'],
}, # 包数据,通常是与软件包实现密切相关的数据
classifiers=[
'Intended Audience :: Developers',
'Topic :: Software Development :: Build Tools',
'License :: OSI Approved :: MIT License',
'Programming Language :: Python :: 3.5',
'Programming Language :: Python :: 3.6',
'Programming Language :: Python :: 3.7',
"Operating System :: OS Independent",
],
)
if __name__ == "__main__":
print(__file__)

@ -0,0 +1,61 @@
#!usr/bin/env python
# -*- coding: utf-8 -*-
# author: kuangdd
# date: 2020/2/18
"""
"""
def test_phkit():
from phkit import text2phoneme, text2sequence, symbol_chinese
from phkit import chinese_sequence_to_text, chinese_text_to_sequence
text = "汉字转音素TTS《Text to speech》。"
target_ph = ['h', 'an', '4', '-', 'z', 'iy', '4', '-', 'zh', 'uan', '3', '-', 'ii', 'in', '1', '-', 's', 'u', '4',
'-', ',', '-',
'Tt', 'Tt', 'Ss', '-', ':', '-', '(', '-', 'T', 'E', 'X', 'T', '-', '#', '-', 'T', 'O', '-', '#', '-',
'S', 'P', 'E', 'E', 'C', 'H', '-', ')', '-', '.', '-', '~', '_']
result = text2phoneme(text)
assert result == target_ph
target_seq = [11, 32, 74, 2, 28, 51, 74, 2, 29, 59, 73, 2, 12, 46, 71, 2, 22, 56, 74, 2, 131, 2, 95, 95, 94, 2, 133,
2, 136, 2, 121,
106, 125, 121, 2, 135, 2, 121, 116, 2, 135, 2, 120, 117, 106, 106, 104, 109, 2, 137, 2, 130, 2, 1, 0]
result = text2sequence(text)
assert result == target_seq
result = chinese_text_to_sequence(text)
assert result == target_seq
target_ph = ' '.join(target_ph)
result = chinese_sequence_to_text(result)
assert result == target_ph
assert len(symbol_chinese) == 145
text = "岂有此理"
target = ['q', 'i', '2', '-', 'ii', 'iu', '3', '-', 'c', 'iy', '2', '-', 'l', 'i', '3', '-', '~', '_']
result = text2phoneme(text)
assert result == target
text = "我的儿子玩会儿"
target = ['uu', 'uo', '3', '-', 'd', 'e', '5', '-', 'ee', 'er', '2', '-', 'z', 'iy', '5', '-', 'uu', 'uan', '2',
'-', 'h', 'ui', '4', '-', 'ee', 'er', '5', '-', '~', '_']
result = text2phoneme(text)
assert result == target
def test_convert():
from phkit import ban2quan, quan2ban, jian2fan, fan2jian
assert ban2quan("aA1 ,:$。、") == "aA1 ,:$。、"
assert quan2ban("aA1 ,:$。、") == "aA1 ,:$。、"
assert jian2fan("中国语言") == "中國語言"
assert fan2jian("中國語言") == "中国语言"
print(fan2jian("中國語言"))
print(jian2fan("中国语言"))
if __name__ == "__main__":
print(__file__)
test_phkit()
test_convert()

@ -16,6 +16,7 @@ from pypinyin.converter import DefaultConverter
from pypinyin.seg import mmseg
from pypinyin.seg import simpleseg
from pypinyin.utils import (_replace_tone2_style_dict_to_default)
import jieba
TStyle = Style
TErrors = Union[Callable[[Text], Text], Text]
@ -139,7 +140,8 @@ class Pinyin():
:param hans: 分词前的字符串
:return: ``None`` or ``list``
"""
pass
outs = jieba.lcut(hans) # 默认用jieba分词从语义角度分词。
return outs
def post_seg(self, hans: Text, seg_data: List[Text],
**kwargs: Any) -> Optional[List[Text]]:

@ -10,3 +10,4 @@ Sphinx
tox
twine
wheel>=0.21
jieba

@ -17,7 +17,7 @@ packages = [
'pypinyin.style',
]
requirements = []
requirements = ["jieba"]
if sys.version_info[:2] < (3, 4):
requirements.append('enum34')
if sys.version_info[:2] < (3, 5):

@ -5,12 +5,6 @@ script: tox
matrix:
include:
- python: 2.7
env: TOXENV=py27
- python: 3.4
env: TOXENV=py34
- python: 3.5
env: TOXENV=py35
- python: 3.6
env: TOXENV=py36
- python: 3.6

@ -1,14 +0,0 @@
=======
Credits
=======
Author and Maintainer
---------------------
* Thomas Roten <https://github.com/tsroten>
Contributors
------------
None yet. Why not be the first?

@ -1,88 +0,0 @@
Changes
=======
v0.1.0 (2013-05-05)
-------------------
* Initial release
v0.1.1 (2013-05-05)
-------------------
* Adds zhon.cedict package to setup.py
v0.2.0 (2013-05-07)
-------------------
* Allows for mapping between simplified and traditional.
* Adds logging to build_string().
* Adds constants for numbered Pinyin and accented Pinyin.
v0.2.1 (2013-05-07)
-------------------
* Fixes typo in README.rst.
v.1.0.0 (2014-01-25)
--------------------
* Complete rewrite that refactors code, renames constants, and improves Pinyin
support.
v.1.1.0 (2014-01-28)
--------------------
* Adds ``zhon.pinyin.punctuation`` constant.
* Adds ``zhon.pinyin.accented_syllable``, ``zhon.pinyin.accented_word``, and
``zhon.pinyin.accented_sentence`` constants.
* Adds ``zhon.pinyin.numbered_syllable``, ``zhon.pinyin.numbered_word``, and
``zhon.pinyin.numbered_sentence`` constants.
* Fixes some README.rst typos.
* Clarifies information regarding Traditional and Simplified character
constants in README.rst.
* Adds constant short names to README.rst.
v.1.1.1 (2014-01-29)
--------------------
* Adds documentation.
* Adds ``zhon.cedict.all`` constant.
* Removes duplicate code ranges from ``zhon.hanzi.characters``.
* Makes ``zhon.hanzi.non_stops`` a string containing all non-stops instead of
a string containing code ranges.
* Removes duplicate letters in ``zhon.pinyin.consonants``.
* Refactors Pinyin vowels/consonant code.
* Removes the Latin alpha from ``zhon.pinyin.vowels``. Fixes #16.
* Adds ``cjk_ideographs`` alias for ``zhon.hanzi.characters``.
* Fixes various typos.
* Removes numbers from Pinyin word constants. Fixes #15.
* Adds lowercase and uppercase constants to ``zhon.pinyin``.
* Fixes a bug with ``zhon.pinyin.sentence``.
* Adds ``sent`` alias for ``zhon.pinyin.sentence``.
v.1.1.2 (2014-01-31)
--------------------
* Fixes bug with ``zhon.cedict.all``.
v.1.1.3 (2014-02-12)
--------------------
* Adds Ideographic number zero to ``zhon.hanzi.characters``. Fixes #17.
* Fixes r-suffix bug. Fixes #18.
v.1.1.4 (2015-01-25)
--------------------
* Removes duplicate module declarations in documentation.
* Moves tests inside zhon package.
* Adds travis config file.
* Adds Python 3.4 tests to travis and tox.
* Fixes flake8 warnings.
* Adds distutil fallback import statment to setup.py.
* Adds missing hanzi punctuation. Fixes #19.
v.1.1.5 (2016-05-23)
--------------------
* Add missing Zhuyin characters. Fixes #23.

@ -1,107 +0,0 @@
============
Contributing
============
Contributions are welcome, and they are greatly appreciated! Every
little bit helps, and credit will always be given.
You can contribute in many ways:
Types of Contributions
----------------------
Report Bugs
~~~~~~~~~~~
Report bugs at https://github.com/tsroten/zhon/issues.
If you are reporting a bug, please include:
* Your operating system name and version.
* Any details about your local setup that might be helpful in troubleshooting.
* Detailed steps to reproduce the bug.
Fix Bugs
~~~~~~~~
Look through the GitHub issues for bugs. Anything tagged with "bug"
is open to whoever wants to implement it.
Implement Features
~~~~~~~~~~~~~~~~~~
Look through the GitHub issues for features. Anything tagged with "feature"
is open to whoever wants to implement it.
Write Documentation
~~~~~~~~~~~~~~~~~~~
Zhon could always use more documentation, whether as part of the
official Zhon docs, in docstrings, or even on the web in blog posts,
articles, and such.
Submit Feedback
~~~~~~~~~~~~~~~
The best way to send feedback is to file an issue at https://github.com/tsroten/zhon/issues.
If you are proposing a feature:
* Explain in detail how it would work.
* Keep the scope as narrow as possible, to make it easier to implement.
* Remember that this is a volunteer-driven project, and that contributions
are welcome :)
Get Started!
------------
Ready to contribute? Here's how to set up `zhon` for local development.
1. Fork the `zhon` repo on GitHub.
2. Clone your fork locally::
$ git clone git@github.com:your_name_here/zhon.git
3. Install your local copy into a virtualenv. Assuming you have virtualenvwrapper installed, this is how you set up your fork for local development::
$ mkvirtualenv zhon
$ cd zhon/
$ python setup.py develop
4. Create a branch for local development::
$ git checkout -b name-of-your-bugfix-or-feature
Now you can make your changes locally.
5. When you're done making changes, check that your changes pass flake8 and the tests, including testing other Python versions with tox::
$ flake8 zhon
$ python setup.py test
$ tox
To get flake8 and tox, just pip install them into your virtualenv.
You can ignore the flake8 errors regarding `zhon.cedict` files. Rather than include hundreds of newline characters in each file, we are ignoring those errors.
6. Commit your changes and push your branch to GitHub::
$ git add .
$ git commit -m "Your detailed description of your changes."
$ git push origin name-of-your-bugfix-or-feature
7. Submit a pull request through the GitHub website.
Pull Request Guidelines
-----------------------
Before you submit a pull request, check that it meets these guidelines:
1. The pull request should include tests.
2. If the pull request adds functionality, the docs should be updated. Put
your new functionality into a function with a docstring, and add the
feature to the list in README.rst.
3. The pull request should work for Python 2.7, 3.3, and 3.4. Check
https://travis-ci.org/tsroten/zhon/pull_requests
and make sure that the tests pass for all supported Python versions.
4. If you want to receive credit, add your name to `AUTHORS.rst`.

Some files were not shown because too many files have changed in this diff Show More

Loading…
Cancel
Save