chinese to phoneme (#633)

pull/636/head
Hui Zhang 3 years ago committed by GitHub
parent 1b373bfcd7
commit fd5213105b
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23

@ -18,3 +18,7 @@ licence: MIT
* [chinese_text_normalization](https://github.com/speechio/chinese_text_normalization.git)
commit: 9e92c7bf2d6b5a7974305406d8e240045beac51c
licence: MIT
* [phkit](https://github.com/KuangDD/phkit.git)
commit: b2100293c1e36da531d7f30bd52c9b955a649522
licence: None

@ -0,0 +1,155 @@
![phkit](phkit.png "phkit")
## phkit
phoneme toolkit: 拼音相关的文本处理工具箱,中文和英文的语音合成前端文本解决方案。
#### 安装
```
pip install -U phkit
```
#### 版本
v0.2.8
### pinyinkit
文本转拼音的模块依赖python-pinyinjiebaphrase-pinyin-data模块。
### chinese
适用于中文、英文和中英混合的音素,其中汉字拼音采用清华大学的音素,英文字符分字母和英文。
- 中文音素简介:
```
声母:
aa b c ch d ee f g h ii j k l m n oo p q r s sh t uu vv x z zh
韵母:
a ai an ang ao e ei en eng er i ia ian iang iao ie in ing iong iu ix iy iz o ong ou u ua uai uan uang ueng ui un uo v van ve vn ng uong
声调:
1 2 3 4 5
字母:
Aa Bb Cc Dd Ee Ff Gg Hh Ii Jj Kk Ll Mm Nn Oo Pp Qq Rr Ss Tt Uu Vv Ww Xx Yy Zz
英文:
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
标点:
! ? . , ; : " # ( )
注:!=!|?=?|.=.。|,=,,、|;=;|:=:|"="“|#=#   |(=([{{【<《|)=)]}}】>》
预留:
w y 0 6 7 8 9
w=%|y=$|0=0|6=6|7=7|8=8|9=9
其他:
_ ~ - *
```
#### symbol
音素标记。
中文音素,简单英文音素,简单中文音素。
#### sequence
转为序列的方法文本转为音素列表文本转为ID列表。
拼音变调,拼音转音素。
#### pinyin
转为拼音的方法,汉字转拼音,分离声调。
拼音为字母+数字形式例如pin1。
#### phoneme
音素映射表。
不带声调拼音转为音素,声调转音素,英文字母转音素,标点转音素。
#### number
数字读法。
按数值大小读,一个一个数字读。
#### convert
文本转换。
全角半角转换,简体繁体转换。
#### style
拼音格式转换。
国标样式的拼音和字母数字的样式的拼音相互转换。
### english
from https://github.com/keithito/tacotron "
Cleaners are transformations that run over the input text at both training and eval time.
Cleaners can be selected by passing a comma-delimited list of cleaner names as the "cleaners"
hyperparameter. Some cleaners are English-specific. You'll typically want to use:
1. "english_cleaners" for English text
2. "transliteration_cleaners" for non-English text that can be transliterated to ASCII using
the Unidecode library (https://pypi.python.org/pypi/Unidecode)
3. "basic_cleaners" if you do not want to transliterate (in this case, you should also update
the symbols in symbols.py to match your data).
### 历史版本
#### v0.2.8
- 文本转拼音轻声用5表示音调。
- 文本转拼音确保文本和拼音一一对应,文本长度和拼音列表长度相同。
- 增加拼音格式转换,国标格式和字母数字格式相互转换。
#### v0.2.7
- 所有中文音素都能被映射到。
#### v0.2.5
- 修正拼音转音素的潜在bug。
#### v0.2.4
- 修正几个默认拼音。
#### v0.2.3
- 汉字转拼音轻量化。
- 词语拼音词典去除全都是默认拼音的词语。
#### v0.2.2
- 修正安装依赖报错问题。
#### v0.2.1
- 增加中文的text_to_sequence方法可替换英文版本应对中文环境。
- 兼容v0.1.0之前版本需要在python3.7版本以上否则请改为从phkit.chinese导入模块。
#### v0.2.0
- 增加文本转拼音的模块依赖python-pinyinjiebaphrase-pinyin-data模块。
- 中文的音素方案移动到chinese模块。
#### v0.1.0
- 增加英文版本的音素方案,包括英文字母和英文音素。
- 增加简单的数字转中文的方法。
#### todo
```
文本正则化处理
数字读法
字符读法
常见规则读法
文本转拼音
pypinyin
国标和alnum转换
anything转音素
字符
英文
汉字
OOV
进阶:
分词
命名实体识别
依存句法分析
```

@ -0,0 +1,115 @@
#!usr/bin/env python
# -*- coding: utf-8 -*-
# author: kuangdd
# date: 2020/2/17
"""
![phkit](phkit.png "phkit")
## phkit
phoneme toolkit: 拼音相关的文本处理工具箱中文和英文的语音合成前端文本解决方案
#### 安装
```
pip install -U phkit
```
"""
__version__ = "0.2.8"
version_doc = """
#### 版本
v{}
""".format(__version__)
history_doc = """
### 历史版本
#### v0.2.8
- 文本转拼音轻声用5表示音调
- 文本转拼音确保文本和拼音一一对应文本长度和拼音列表长度相同
- 增加拼音格式转换国标格式和字母数字格式相互转换
#### v0.2.7
- 所有中文音素都能被映射到
#### v0.2.5
- 修正拼音转音素的潜在bug
#### v0.2.4
- 修正几个默认拼音
#### v0.2.3
- 汉字转拼音轻量化
- 词语拼音词典去除全都是默认拼音的词语
#### v0.2.2
- 修正安装依赖报错问题
#### v0.2.1
- 增加中文的text_to_sequence方法可替换英文版本应对中文环境
- 兼容v0.1.0之前版本需要在python3.7版本以上否则请改为从phkit.chinese导入模块
#### v0.2.0
- 增加文本转拼音的模块依赖python-pinyinjiebaphrase-pinyin-data模块
- 中文的音素方案移动到chinese模块
#### v0.1.0
- 增加英文版本的音素方案包括英文字母和英文音素
- 增加简单的数字转中文的方法
#### todo
```
文本正则化处理
数字读法
字符读法
常见规则读法
文本转拼音
pypinyin
国标和alnum转换
anything转音素
字符
英文
汉字
OOV
进阶:
分词
命名实体识别
依存句法分析
```
"""
from phkit.chinese import __doc__ as doc_chinese
from phkit.chinese.symbol import __doc__ as doc_symbol
from phkit.chinese.sequence import __doc__ as doc_sequence
from phkit.chinese.pinyin import __doc__ as doc_pinyin
from phkit.chinese.phoneme import __doc__ as doc_phoneme
from phkit.chinese.number import __doc__ as doc_number
from phkit.chinese.convert import __doc__ as doc_convert
from phkit.chinese.style import __doc__ as doc_style
from .english import __doc__ as doc_english
from .pinyinkit import __doc__ as doc_pinyinkit
readme_docs = [__doc__, version_doc,
doc_pinyinkit,
doc_chinese, doc_symbol, doc_sequence, doc_pinyin, doc_phoneme, doc_number, doc_convert, doc_style,
doc_english,
history_doc]
from .chinese import text_to_sequence as chinese_text_to_sequence, sequence_to_text as chinese_sequence_to_text
from .english import text_to_sequence as english_text_to_sequence, sequence_to_text as english_sequence_to_text
from .pinyinkit import lazy_pinyin, pinyin, slug, initialize
# 兼容0.1.0之前的版本python3.7以上版本支持。
from .chinese import convert, number, phoneme, sequence, symbol, style
from .chinese.style import guobiao2shengyundiao, shengyundiao2guobiao
from .chinese.convert import fan2jian, jian2fan, quan2ban, ban2quan
from .chinese.number import say_digit, say_decimal, say_number
from .chinese.pinyin import text2pinyin, split_pinyin
from .chinese.sequence import text2sequence, text2phoneme, pinyin2phoneme, phoneme2sequence, sequence2phoneme
from .chinese.sequence import symbol_chinese, ph2id_dict, id2ph_dict
if __name__ == "__main__":
print(__file__)

@ -0,0 +1,79 @@
"""
### chinese
适用于中文英文和中英混合的音素其中汉字拼音采用清华大学的音素英文字符分字母和英文
- 中文音素简介
```
声母
aa b c ch d ee f g h ii j k l m n oo p q r s sh t uu vv x z zh
韵母
a ai an ang ao e ei en eng er i ia ian iang iao ie in ing iong iu ix iy iz o ong ou u ua uai uan uang ueng ui un uo v van ve vn ng uong
声调
1 2 3 4 5
字母
Aa Bb Cc Dd Ee Ff Gg Hh Ii Jj Kk Ll Mm Nn Oo Pp Qq Rr Ss Tt Uu Vv Ww Xx Yy Zz
英文
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
标点
! ? . , ; : " # ( )
!=!|?=?|.=.|,=,|;=;|:=:|"="|#=#  \t|(=([{{【<《|)=)]}}】>》
预留
w y 0 6 7 8 9
w=%|y=$|0=0|6=6|7=7|8=8|9=9
其他
_ ~ - *
```
"""
from .convert import fan2jian, jian2fan, quan2ban, ban2quan
from .number import say_digit, say_decimal, say_number
from .pinyin import text2pinyin, split_pinyin
from .sequence import text2sequence, text2phoneme, pinyin2phoneme, phoneme2sequence, sequence2phoneme, change_diao
from .sequence import symbol_chinese, ph2id_dict, id2ph_dict
from .symbol import symbol_chinese as symbols
from .phoneme import shengyun2ph_dict
def text_to_sequence(src, cleaner_names=None, **kwargs):
"""
文本样例卡尔普陪外孙玩滑梯
拼音样例ka3 er3 pu3 pei2 wai4 sun1 wan2 hua2 ti1 .
:param src: str,拼音或文本字符串
:param cleaner_names: 文本处理方法选择暂时提供拼音和文本两种方法
:return: list,ID列表
"""
if cleaner_names == "pinyin":
pys = []
for py in src.split():
if py.isalnum():
pys.append(py)
else:
pys.append((py,))
phs = pinyin2phoneme(pys)
phs = change_diao(phs)
seq = phoneme2sequence(phs)
return seq
else:
return text2sequence(src)
def sequence_to_text(src):
out = sequence2phoneme(src)
return " ".join(out)
if __name__ == "__main__":
print(__file__)
text = "ka3 er3 pu3 pei2 wai4 sun1 wan2 hua2 ti1 . "
out = text_to_sequence(text)
print(out)
out = sequence_to_text(out)
print(out)

@ -0,0 +1,51 @@
#!usr/bin/env python
# -*- coding: utf-8 -*-
# author: kuangdd
# date: 2020/2/17
"""
#### convert
文本转换
全角半角转换简体繁体转换
"""
from hanziconv import hanziconv
hc = hanziconv.HanziConv()
# 繁体转简体
fan2jian = hc.toSimplified
# 简体转繁体
jian2fan = hc.toTraditional
# 半角转全角映射表
ban2quan_dict = {i: i + 65248 for i in range(33, 127)}
ban2quan_dict.update({32: 12288})
# 全角转半角映射表
quan2ban_dict = {v: k for k, v in ban2quan_dict.items()}
def ban2quan(text: str):
"""
半角转全角
:param text:
:return:
"""
return text.translate(ban2quan_dict)
def quan2ban(text: str):
"""
全角转半角
:param text:
:return:
"""
return text.translate(quan2ban_dict)
if __name__ == "__main__":
assert ban2quan("aA1 ,:$。、") == "aA1 ,:$。、"
assert quan2ban("aA1 ,:$。、") == "aA1 ,:$。、"
assert jian2fan("中国语言") == "中國語言"
assert jian2fan("中國語言") == "中国语言"

@ -0,0 +1,81 @@
#!usr/bin/env python
# -*- coding: utf-8 -*-
# author: kuangdd
# date: 2020/2/16
"""
#### number
数字读法
按数值大小读一个一个数字读
"""
import re
_number_cn = ['', '', '', '', '', '', '', '', '', '']
_number_level = ['', '', '', '', '', '', '', '亿', '', '', '', '', '', '', '', '']
_zero = _number_cn[0]
_ten_re = re.compile(r'^一十')
_grade_level = {'', '亿', ''}
_number_group_re = re.compile(r"([0-9]+)")
def say_digit(num: str):
outs = []
for zi in num:
outs.append(_number_cn[int(zi)])
return ''.join(outs)
def say_number(num: str):
x = str(int(num))
if x == '0':
return _number_cn[0]
elif len(x) > 16:
return num
length = len(x)
outs = []
for num, zi in enumerate(x):
a = _number_cn[int(zi)]
b = _number_level[len(_number_level) - length + num]
if a != _zero:
outs.append(a)
outs.append(b)
else:
if b in _grade_level:
if outs[-1] != _zero:
outs.append(b)
else:
outs[-1] = b
else:
if outs[-1] != _zero:
outs.append(a)
out = ''.join(outs[:-1])
out = _ten_re.sub(r'', out)
return out
def say_decimal(num: str):
z, x = num.split('.')
z_cn = say_number(z)
x_cn = say_digit(x)
return z_cn + '' + x_cn
def convert_number(text):
parts = _number_group_re.split(text)
outs = []
for elem in parts:
if elem.isdigit():
if len(elem) <= 9:
outs.append(say_number(elem))
else:
outs.append(say_digit(elem))
else:
outs.append(elem)
return ''.join(outs)
if __name__ == "__main__":
print(__file__)
assert say_number("1234567890123456") == "一千二百三十四万五千六百七十八亿九千零一十二万三千四百五十六"
assert say_digit("123456") == "一二三四五六"
assert say_decimal("3.14") == "三点一四"
assert convert_number("hello314.1592and2718281828") == "hello三百一十四.一千五百九十二and二七一八二八一八二八"

@ -0,0 +1,480 @@
#!usr/bin/env python
# -*- coding: utf-8 -*-
# author: kuangdd
# date: 2020/2/16
"""
#### phoneme
音素映射表
不带声调拼音转为音素声调转音素英文字母转音素标点转音素
"""
# 拼音转音素映射表420
shengyun2ph_dict = {
'a': 'aa a',
'ai': 'aa ai',
'an': 'aa an',
'ang': 'aa ang',
'ao': 'aa ao',
'ba': 'b a',
'bai': 'b ai',
'ban': 'b an',
'bang': 'b ang',
'bao': 'b ao',
'bei': 'b ei',
'ben': 'b en',
'beng': 'b eng',
'bi': 'b i',
'bian': 'b ian',
'biao': 'b iao',
'bie': 'b ie',
'bin': 'b in',
'bing': 'b ing',
'bo': 'b o',
'bu': 'b u',
'ca': 'c a',
'cai': 'c ai',
'can': 'c an',
'cang': 'c ang',
'cao': 'c ao',
'ce': 'c e',
'cen': 'c en',
'ceng': 'c eng',
'ci': 'c iy',
'cong': 'c ong',
'cou': 'c ou',
'cu': 'c u',
'cuan': 'c uan',
'cui': 'c ui',
'cun': 'c un',
'cuo': 'c uo',
'cha': 'ch a',
'chai': 'ch ai',
'chan': 'ch an',
'chang': 'ch ang',
'chao': 'ch ao',
'che': 'ch e',
'chen': 'ch en',
'cheng': 'ch eng',
'chi': 'ch ix',
'chong': 'ch ong',
'chou': 'ch ou',
'chu': 'ch u',
'chuai': 'ch uai',
'chuan': 'ch uan',
'chuang': 'ch uang',
'chui': 'ch ui',
'chun': 'ch un',
'chuo': 'ch uo',
'da': 'd a',
'dai': 'd ai',
'dan': 'd an',
'dang': 'd ang',
'dao': 'd ao',
'de': 'd e',
'dei': 'd ei',
'deng': 'd eng',
'di': 'd i',
'dia': 'd ia',
'dian': 'd ian',
'diao': 'd iao',
'die': 'd ie',
'ding': 'd ing',
'diu': 'd iu',
'dong': 'd ong',
'dou': 'd ou',
'du': 'd u',
'duan': 'd uan',
'dui': 'd ui',
'dun': 'd un',
'duo': 'd uo',
'e': 'ee e',
'ei': 'ee ei',
'en': 'ee en',
'er': 'ee er',
'fa': 'f a',
'fan': 'f an',
'fang': 'f ang',
'fei': 'f ei',
'fen': 'f en',
'feng': 'f eng',
'fo': 'f o',
'fou': 'f ou',
'fu': 'f u',
'ga': 'g a',
'gai': 'g ai',
'gan': 'g an',
'gang': 'g ang',
'gao': 'g ao',
'ge': 'g e',
'gei': 'g ei',
'gen': 'g en',
'geng': 'g eng',
'gong': 'g ong',
'gou': 'g ou',
'gu': 'g u',
'gua': 'g ua',
'guai': 'g uai',
'guan': 'g uan',
'guang': 'g uang',
'gui': 'g ui',
'gun': 'g un',
'guo': 'g uo',
'ha': 'h a',
'hai': 'h ai',
'han': 'h an',
'hang': 'h ang',
'hao': 'h ao',
'he': 'h e',
'hei': 'h ei',
'hen': 'h en',
'heng': 'h eng',
'hong': 'h ong',
'hou': 'h ou',
'hu': 'h u',
'hua': 'h ua',
'huai': 'h uai',
'huan': 'h uan',
'huang': 'h uang',
'hui': 'h ui',
'hun': 'h un',
'huo': 'h uo',
'yi': 'ii i',
'ya': 'ii ia',
'yan': 'ii ian',
'yang': 'ii iang',
'yao': 'ii iao',
'ye': 'ii ie',
'yin': 'ii in',
'ying': 'ii ing',
'yong': 'ii iong',
'you': 'ii iu',
'ji': 'j i',
'jia': 'j ia',
'jian': 'j ian',
'jiang': 'j iang',
'jiao': 'j iao',
'jie': 'j ie',
'jin': 'j in',
'jing': 'j ing',
'jiong': 'j iong',
'jiu': 'j iu',
'ju': 'j v',
'juan': 'j van',
'jue': 'j ve',
'jun': 'j vn',
'ka': 'k a',
'kai': 'k ai',
'kan': 'k an',
'kang': 'k ang',
'kao': 'k ao',
'ke': 'k e',
'ken': 'k en',
'keng': 'k eng',
'kong': 'k ong',
'kou': 'k ou',
'ku': 'k u',
'kua': 'k ua',
'kuai': 'k uai',
'kuan': 'k uan',
'kuang': 'k uang',
'kui': 'k ui',
'kun': 'k un',
'kuo': 'k uo',
'la': 'l a',
'lai': 'l ai',
'lan': 'l an',
'lang': 'l ang',
'lao': 'l ao',
'le': 'l e',
'lei': 'l ei',
'leng': 'l eng',
'li': 'l i',
'lia': 'l ia',
'lian': 'l ian',
'liang': 'l iang',
'liao': 'l iao',
'lie': 'l ie',
'lin': 'l in',
'ling': 'l ing',
'liu': 'l iu',
'lo': 'l o',
'long': 'l ong',
'lou': 'l ou',
'lu': 'l u',
'luan': 'l uan',
'lun': 'l un',
'luo': 'l uo',
'lv': 'l v',
'lve': 'l ve',
'ma': 'm a',
'mai': 'm ai',
'man': 'm an',
'mang': 'm ang',
'mao': 'm ao',
'me': 'm e',
'mei': 'm ei',
'men': 'm en',
'meng': 'm eng',
'mi': 'm i',
'mian': 'm ian',
'miao': 'm iao',
'mie': 'm ie',
'min': 'm in',
'ming': 'm ing',
'miu': 'm iu',
'mo': 'm o',
'mou': 'm ou',
'mu': 'm u',
'na': 'n a',
'nai': 'n ai',
'nan': 'n an',
'nang': 'n ang',
'nao': 'n ao',
'ne': 'n e',
'nei': 'n ei',
'nen': 'n en',
'neng': 'n eng',
'ni': 'n i',
'nian': 'n ian',
'niang': 'n iang',
'niao': 'n iao',
'nie': 'n ie',
'nin': 'n in',
'ning': 'n ing',
'niu': 'n iu',
'nong': 'n ong',
'nu': 'n u',
'nuan': 'n uan',
'nuo': 'n uo',
'nv': 'n v',
'nve': 'n ve',
'o': 'oo o',
'ou': 'oo ou',
'pa': 'p a',
'pai': 'p ai',
'pan': 'p an',
'pang': 'p ang',
'pao': 'p ao',
'pei': 'p ei',
'pen': 'p en',
'peng': 'p eng',
'pi': 'p i',
'pian': 'p ian',
'piao': 'p iao',
'pie': 'p ie',
'pin': 'p in',
'ping': 'p ing',
'po': 'p o',
'pou': 'p ou',
'pu': 'p u',
'qi': 'q i',
'qia': 'q ia',
'qian': 'q ian',
'qiang': 'q iang',
'qiao': 'q iao',
'qie': 'q ie',
'qin': 'q in',
'qing': 'q ing',
'qiong': 'q iong',
'qiu': 'q iu',
'qu': 'q v',
'quan': 'q van',
'que': 'q ve',
'qun': 'q vn',
'ran': 'r an',
'rang': 'r ang',
'rao': 'r ao',
're': 'r e',
'ren': 'r en',
'reng': 'r eng',
'ri': 'r iz',
'rong': 'r ong',
'rou': 'r ou',
'ru': 'r u',
'ruan': 'r uan',
'rui': 'r ui',
'run': 'r un',
'ruo': 'r uo',
'sa': 's a',
'sai': 's ai',
'san': 's an',
'sang': 's ang',
'sao': 's ao',
'se': 's e',
'sen': 's en',
'seng': 's eng',
'si': 's iy',
'song': 's ong',
'sou': 's ou',
'su': 's u',
'suan': 's uan',
'sui': 's ui',
'sun': 's un',
'suo': 's uo',
'sha': 'sh a',
'shai': 'sh ai',
'shan': 'sh an',
'shang': 'sh ang',
'shao': 'sh ao',
'she': 'sh e',
'shei': 'sh ei',
'shen': 'sh en',
'sheng': 'sh eng',
'shi': 'sh ix',
'shou': 'sh ou',
'shu': 'sh u',
'shua': 'sh ua',
'shuai': 'sh uai',
'shuan': 'sh uan',
'shuang': 'sh uang',
'shui': 'sh ui',
'shun': 'sh un',
'shuo': 'sh uo',
'ta': 't a',
'tai': 't ai',
'tan': 't an',
'tang': 't ang',
'tao': 't ao',
'te': 't e',
'teng': 't eng',
'ti': 't i',
'tian': 't ian',
'tiao': 't iao',
'tie': 't ie',
'ting': 't ing',
'tong': 't ong',
'tou': 't ou',
'tu': 't u',
'tuan': 't uan',
'tui': 't ui',
'tun': 't un',
'tuo': 't uo',
'wu': 'uu u',
'wa': 'uu ua',
'wai': 'uu uai',
'wan': 'uu uan',
'wang': 'uu uang',
'weng': 'uu ueng',
'wei': 'uu ui',
'wen': 'uu un',
'wo': 'uu uo',
'yu': 'vv v',
'yuan': 'vv van',
'yue': 'vv ve',
'yun': 'vv vn',
'xi': 'x i',
'xia': 'x ia',
'xian': 'x ian',
'xiang': 'x iang',
'xiao': 'x iao',
'xie': 'x ie',
'xin': 'x in',
'xing': 'x ing',
'xiong': 'x iong',
'xiu': 'x iu',
'xu': 'x v',
'xuan': 'x van',
'xue': 'x ve',
'xun': 'x vn',
'za': 'z a',
'zai': 'z ai',
'zan': 'z an',
'zang': 'z ang',
'zao': 'z ao',
'ze': 'z e',
'zei': 'z ei',
'zen': 'z en',
'zeng': 'z eng',
'zi': 'z iy',
'zong': 'z ong',
'zou': 'z ou',
'zu': 'z u',
'zuan': 'z uan',
'zui': 'z ui',
'zun': 'z un',
'zuo': 'z uo',
'zha': 'zh a',
'zhai': 'zh ai',
'zhan': 'zh an',
'zhang': 'zh ang',
'zhao': 'zh ao',
'zhe': 'zh e',
'zhei': 'zh ei',
'zhen': 'zh en',
'zheng': 'zh eng',
'zhi': 'zh ix',
'zhong': 'zh ong',
'zhou': 'zh ou',
'zhu': 'zh u',
'zhua': 'zh ua',
'zhuai': 'zh uai',
'zhuan': 'zh uan',
'zhuang': 'zh uang',
'zhui': 'zh ui',
'zhun': 'zh un',
'zhuo': 'zh uo',
'cei': 'c ei',
'chua': 'ch ua',
'den': 'd en',
'din': 'd in',
'eng': 'ee eng',
'ng': 'ee ng',
'fiao': 'f iao',
'yo': 'ii o',
'kei': 'k ei',
'len': 'l en',
'nia': 'n ia',
'nou': 'n ou',
'nun': 'n un',
'rua': 'r ua',
'tei': 't ei',
'wong': 'uu uong',
'n': 'n ng'
}
diao2ph_dict = {'1': '1', '2': '2', '3': '3', '4': '4', '5': '5'}
# 字母音素26
_alphabet = 'Aa Bb Cc Dd Ee Ff Gg Hh Ii Jj Kk Ll Mm Nn Oo Pp Qq Rr Ss Tt Uu Vv Ww Xx Yy Zz'.split()
# 字母26
_upper = list('ABCDEFGHIJKLMNOPQRSTUVWXYZ')
_lower = list('abcdefghijklmnopqrstuvwxyz')
upper2ph_dict = dict(zip(_upper, _alphabet))
lower2ph_dict = dict(zip(_lower, _upper))
# 标点9
_biaodian = '! ? . , ; : " # ( )'.split()
# 注:!=!|?=?|.=.。|,=,,、|;=;|:=:|"="“”'|#=  \t|(=([{{【<《|)=)]}}】>》
biao2ph_dict = {
'!': '!', '': '!',
'?': '?', '': '?',
'.': '.', '': '.',
',': ',', '': ',', '': ',',
';': ';', '': ';',
':': ':', '': ':',
'"': '"', '': '"', '': '"', "'": '"', '': '"', '': '"',
'#': '#', '': '#', ' ': '#', ' ': '#', '\t': '#',
'(': '(', '': '(', '[': '(', '': '(', '{': '(', '': '(', '': '(', '<': '(', '': '(',
')': ')', '': ')', ']': ')', '': ')', '}': ')', '': ')', '': ')', '>': ')', '': ')'
}
# 其他7
_other = 'w y 0 6 7 8 9'.split()
other2ph_dict = {
'%': 'w',
'$': 'y',
'0': '0',
'6': '6',
'7': '7',
'8': '8',
'9': '9'
}
char2ph_dict = {**upper2ph_dict, **lower2ph_dict, **biao2ph_dict, **other2ph_dict}
if __name__ == "__main__":
print(__file__)

@ -0,0 +1,11 @@
#!usr/bin/env python
# -*- coding: utf-8 -*-
# author: kuangdd
# date: 2020/2/17
"""
#### pinyin
转为拼音的方法汉字转拼音分离声调
拼音为字母+数字形式例如pin1
"""
from ..pinyinkit import text2pinyin, split_pinyin

@ -0,0 +1,153 @@
#!usr/bin/env python
# -*- coding: utf-8 -*-
# author: kuangdd
# date: 2020/2/16
"""
#### sequence
转为序列的方法文本转为音素列表文本转为ID列表
拼音变调拼音转音素
"""
from .phoneme import shengyun2ph_dict, diao2ph_dict, char2ph_dict
from .pinyin import text2pinyin, split_pinyin
from .symbol import _chain, _eos, _pad, symbol_chinese
from .convert import fan2jian, quan2ban
from .number import convert_number
import re
# 分隔英文字母
_en_re = re.compile(r"([a-zA-Z]+)")
phs = ({w for p in shengyun2ph_dict.values() for w in p.split()}
| set(diao2ph_dict.values()) | set(char2ph_dict.values()))
assert bool(phs - set(symbol_chinese)) is False
ph2id_dict = {p: i for i, p in enumerate(symbol_chinese)}
id2ph_dict = {i: p for i, p in enumerate(symbol_chinese)}
assert len(ph2id_dict) == len(id2ph_dict)
def text2phoneme(text):
"""
文本转为音素用中文音素方案
中文转为拼音按照清华大学方案转为音素分为辅音元音音调
英文全部大写转为字母读音
英文非全部大写转为英文读音
标点映射为音素
:param text: str,正则化后的文本
:return: list,音素列表
"""
text = normalize_chinese(text)
text = normalize_english(text)
pys = text2pinyin(text, errors=lambda x: (x,))
phs = pinyin2phoneme(pys)
phs = change_diao(phs)
return phs
def text2sequence(text):
"""
文本转为ID序列
:param text:
:return:
"""
phs = text2phoneme(text)
seq = phoneme2sequence(phs)
return seq
def pinyin2phoneme(src):
"""
拼音或其他字符转音素
:param src: list,拼音用str格式其他用tuple格式
:return: list
"""
out = []
for py in src:
if type(py) is str:
fuyuan, diao = split_pinyin(py)
if fuyuan in shengyun2ph_dict and diao in diao2ph_dict:
phs = shengyun2ph_dict[fuyuan].split()
phs.append(diao2ph_dict[diao])
else:
phs = py_errors(py)
else:
phs = []
for w in py:
ph = py_errors(w)
phs.extend(ph)
if phs:
out.extend(phs)
out.append(_chain)
out.append(_eos)
out.append(_pad)
return out
def change_diao(src):
"""
拼音变声调连续上声声调的把前一个上声变为阳平
:param src: list,音素列表
:return: list,变调后的音素列表
"""
flag = -5
out = []
for i, w in enumerate(reversed(src)):
if w == '3':
if i - flag == 4:
out.append('2')
else:
flag = i
out.append(w)
else:
out.append(w)
return list(reversed(out))
def phoneme2sequence(src):
out = []
for w in src:
if w in ph2id_dict:
out.append(ph2id_dict[w])
return out
def sequence2phoneme(src):
out = []
for w in src:
if w in id2ph_dict:
out.append(id2ph_dict[w])
return out
def py_errors(text):
out = []
for p in text:
if p in char2ph_dict:
out.append(char2ph_dict[p])
return out
def normalize_chinese(text):
text = quan2ban(text)
text = fan2jian(text)
text = convert_number(text)
return text
def normalize_english(text):
out = []
parts = _en_re.split(text)
for part in parts:
if not part.isupper():
out.append(part.lower())
else:
out.append(part)
return "".join(out)
if __name__ == "__main__":
print(__file__)

@ -0,0 +1,339 @@
# author: kuangdd
# date: 2021/5/8
"""
#### style
拼音格式转换
国标样式的拼音和字母数字的样式的拼音相互转换
"""
from pathlib import Path
import logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(Path(__file__).stem)
# 2100 = 420 * 5
guobiao2shengyundiao_dict = {
'a': 'a5', 'ā': 'a1', 'á': 'a2', 'ǎ': 'a3', 'à': 'a4', 'ai': 'ai5', 'āi': 'ai1', 'ái': 'ai2', 'ǎi': 'ai3',
'ài': 'ai4', 'an': 'an5', 'ān': 'an1', 'án': 'an2', 'ǎn': 'an3', 'àn': 'an4', 'ang': 'ang5', 'āng': 'ang1',
'áng': 'ang2', 'ǎng': 'ang3', 'àng': 'ang4', 'ao': 'ao5', 'āo': 'ao1', 'áo': 'ao2', 'ǎo': 'ao3', 'ào': 'ao4',
'ba': 'ba5', '': 'ba1', '': 'ba2', '': 'ba3', '': 'ba4', 'bai': 'bai5', 'bāi': 'bai1', 'bái': 'bai2',
'bǎi': 'bai3', 'bài': 'bai4', 'ban': 'ban5', 'bān': 'ban1', 'bán': 'ban2', 'bǎn': 'ban3', 'bàn': 'ban4',
'bang': 'bang5', 'bāng': 'bang1', 'báng': 'bang2', 'bǎng': 'bang3', 'bàng': 'bang4', 'bao': 'bao5', 'bāo': 'bao1',
'báo': 'bao2', 'bǎo': 'bao3', 'bào': 'bao4', 'bei': 'bei5', 'bēi': 'bei1', 'béi': 'bei2', 'běi': 'bei3',
'bèi': 'bei4', 'ben': 'ben5', 'bēn': 'ben1', 'bén': 'ben2', 'běn': 'ben3', 'bèn': 'ben4', 'beng': 'beng5',
'bēng': 'beng1', 'béng': 'beng2', 'běng': 'beng3', 'bèng': 'beng4', 'bi': 'bi5', '': 'bi1', '': 'bi2',
'': 'bi3', '': 'bi4', 'bian': 'bian5', 'biān': 'bian1', 'bián': 'bian2', 'biǎn': 'bian3', 'biàn': 'bian4',
'biao': 'biao5', 'biāo': 'biao1', 'biáo': 'biao2', 'biǎo': 'biao3', 'biào': 'biao4', 'bie': 'bie5', 'biē': 'bie1',
'bié': 'bie2', 'biě': 'bie3', 'biè': 'bie4', 'bin': 'bin5', 'bīn': 'bin1', 'bín': 'bin2', 'bǐn': 'bin3',
'bìn': 'bin4', 'bing': 'bing5', 'bīng': 'bing1', 'bíng': 'bing2', 'bǐng': 'bing3', 'bìng': 'bing4', 'bo': 'bo5',
'': 'bo1', '': 'bo2', '': 'bo3', '': 'bo4', 'bu': 'bu5', '': 'bu1', '': 'bu2', '': 'bu3', '': 'bu4',
'ca': 'ca5', '': 'ca1', '': 'ca2', '': 'ca3', '': 'ca4', 'cai': 'cai5', 'cāi': 'cai1', 'cái': 'cai2',
'cǎi': 'cai3', 'cài': 'cai4', 'can': 'can5', 'cān': 'can1', 'cán': 'can2', 'cǎn': 'can3', 'càn': 'can4',
'cang': 'cang5', 'cāng': 'cang1', 'cáng': 'cang2', 'cǎng': 'cang3', 'càng': 'cang4', 'cao': 'cao5', 'cāo': 'cao1',
'cáo': 'cao2', 'cǎo': 'cao3', 'cào': 'cao4', 'ce': 'ce5', '': 'ce1', '': 'ce2', '': 'ce3', '': 'ce4',
'cen': 'cen5', 'cēn': 'cen1', 'cén': 'cen2', 'cěn': 'cen3', 'cèn': 'cen4', 'ceng': 'ceng5', 'cēng': 'ceng1',
'céng': 'ceng2', 'cěng': 'ceng3', 'cèng': 'ceng4', 'cha': 'cha5', 'chā': 'cha1', 'chá': 'cha2', 'chǎ': 'cha3',
'chà': 'cha4', 'chai': 'chai5', 'chāi': 'chai1', 'chái': 'chai2', 'chǎi': 'chai3', 'chài': 'chai4', 'chan': 'chan5',
'chān': 'chan1', 'chán': 'chan2', 'chǎn': 'chan3', 'chàn': 'chan4', 'chang': 'chang5', 'chāng': 'chang1',
'cháng': 'chang2', 'chǎng': 'chang3', 'chàng': 'chang4', 'chao': 'chao5', 'chāo': 'chao1', 'cháo': 'chao2',
'chǎo': 'chao3', 'chào': 'chao4', 'che': 'che5', 'chē': 'che1', 'ché': 'che2', 'chě': 'che3', 'chè': 'che4',
'chen': 'chen5', 'chēn': 'chen1', 'chén': 'chen2', 'chěn': 'chen3', 'chèn': 'chen4', 'cheng': 'cheng5',
'chēng': 'cheng1', 'chéng': 'cheng2', 'chěng': 'cheng3', 'chèng': 'cheng4', 'chi': 'chi5', 'chī': 'chi1',
'chí': 'chi2', 'chǐ': 'chi3', 'chì': 'chi4', 'chong': 'chong5', 'chōng': 'chong1', 'chóng': 'chong2',
'chǒng': 'chong3', 'chòng': 'chong4', 'chou': 'chou5', 'chōu': 'chou1', 'chóu': 'chou2', 'chǒu': 'chou3',
'chòu': 'chou4', 'chu': 'chu5', 'chū': 'chu1', 'chú': 'chu2', 'chǔ': 'chu3', 'chù': 'chu4', 'chuai': 'chuai5',
'chuāi': 'chuai1', 'chuái': 'chuai2', 'chuǎi': 'chuai3', 'chuài': 'chuai4', 'chuan': 'chuan5', 'chuān': 'chuan1',
'chuán': 'chuan2', 'chuǎn': 'chuan3', 'chuàn': 'chuan4', 'chuang': 'chuang5', 'chuāng': 'chuang1',
'chuáng': 'chuang2', 'chuǎng': 'chuang3', 'chuàng': 'chuang4', 'chui': 'chui5', 'chuī': 'chui1', 'chuí': 'chui2',
'chuǐ': 'chui3', 'chuì': 'chui4', 'chun': 'chun5', 'chūn': 'chun1', 'chún': 'chun2', 'chǔn': 'chun3',
'chùn': 'chun4', 'chuo': 'chuo5', 'chuō': 'chuo1', 'chuó': 'chuo2', 'chuǒ': 'chuo3', 'chuò': 'chuo4', 'ci': 'ci5',
'': 'ci1', '': 'ci2', '': 'ci3', '': 'ci4', 'cong': 'cong5', 'cōng': 'cong1', 'cóng': 'cong2',
'cǒng': 'cong3', 'còng': 'cong4', 'cou': 'cou5', 'cōu': 'cou1', 'cóu': 'cou2', 'cǒu': 'cou3', 'còu': 'cou4',
'cu': 'cu5', '': 'cu1', '': 'cu2', '': 'cu3', '': 'cu4', 'cuan': 'cuan5', 'cuān': 'cuan1', 'cuán': 'cuan2',
'cuǎn': 'cuan3', 'cuàn': 'cuan4', 'cui': 'cui5', 'cuī': 'cui1', 'cuí': 'cui2', 'cuǐ': 'cui3', 'cuì': 'cui4',
'cun': 'cun5', 'cūn': 'cun1', 'cún': 'cun2', 'cǔn': 'cun3', 'cùn': 'cun4', 'cuo': 'cuo5', 'cuō': 'cuo1',
'cuó': 'cuo2', 'cuǒ': 'cuo3', 'cuò': 'cuo4', 'da': 'da5', '': 'da1', '': 'da2', '': 'da3', '': 'da4',
'dai': 'dai5', 'dāi': 'dai1', 'dái': 'dai2', 'dǎi': 'dai3', 'dài': 'dai4', 'dan': 'dan5', 'dān': 'dan1',
'dán': 'dan2', 'dǎn': 'dan3', 'dàn': 'dan4', 'dang': 'dang5', 'dāng': 'dang1', 'dáng': 'dang2', 'dǎng': 'dang3',
'dàng': 'dang4', 'dao': 'dao5', 'dāo': 'dao1', 'dáo': 'dao2', 'dǎo': 'dao3', 'dào': 'dao4', 'de': 'de5',
'': 'de1', '': 'de2', '': 'de3', '': 'de4', 'dei': 'dei5', 'dēi': 'dei1', 'déi': 'dei2', 'děi': 'dei3',
'dèi': 'dei4', 'den': 'den5', 'dēn': 'den1', 'dén': 'den2', 'děn': 'den3', 'dèn': 'den4', 'deng': 'deng5',
'dēng': 'deng1', 'déng': 'deng2', 'děng': 'deng3', 'dèng': 'deng4', 'di': 'di5', '': 'di1', '': 'di2',
'': 'di3', '': 'di4', 'dia': 'dia5', 'diā': 'dia1', 'diá': 'dia2', 'diǎ': 'dia3', 'dià': 'dia4',
'dian': 'dian5', 'diān': 'dian1', 'dián': 'dian2', 'diǎn': 'dian3', 'diàn': 'dian4', 'diao': 'diao5',
'diāo': 'diao1', 'diáo': 'diao2', 'diǎo': 'diao3', 'diào': 'diao4', 'die': 'die5', 'diē': 'die1', 'dié': 'die2',
'diě': 'die3', 'diè': 'die4', 'ding': 'ding5', 'dīng': 'ding1', 'díng': 'ding2', 'dǐng': 'ding3', 'dìng': 'ding4',
'diu': 'diu5', 'diū': 'diu1', 'diú': 'diu2', 'diǔ': 'diu3', 'diù': 'diu4', 'dong': 'dong5', 'dōng': 'dong1',
'dóng': 'dong2', 'dǒng': 'dong3', 'dòng': 'dong4', 'dou': 'dou5', 'dōu': 'dou1', 'dóu': 'dou2', 'dǒu': 'dou3',
'dòu': 'dou4', 'du': 'du5', '': 'du1', '': 'du2', '': 'du3', '': 'du4', 'duan': 'duan5', 'duān': 'duan1',
'duán': 'duan2', 'duǎn': 'duan3', 'duàn': 'duan4', 'dui': 'dui5', 'duī': 'dui1', 'duí': 'dui2', 'duǐ': 'dui3',
'duì': 'dui4', 'dun': 'dun5', 'dūn': 'dun1', 'dún': 'dun2', 'dǔn': 'dun3', 'dùn': 'dun4', 'duo': 'duo5',
'duō': 'duo1', 'duó': 'duo2', 'duǒ': 'duo3', 'duò': 'duo4', 'e': 'e5', 'ē': 'e1', 'é': 'e2', 'ě': 'e3', 'è': 'e4',
'ei': 'ei5', 'ēi': 'ei1', 'éi': 'ei2', 'ěi': 'ei3', 'èi': 'ei4', 'en': 'en5', 'ēn': 'en1', 'én': 'en2', 'ěn': 'en3',
'èn': 'en4', 'eng': 'eng5', 'ēng': 'eng1', 'éng': 'eng2', 'ěng': 'eng3', 'èng': 'eng4', 'er': 'er5', 'ēr': 'er1',
'ér': 'er2', 'ěr': 'er3', 'èr': 'er4', 'fa': 'fa5', '': 'fa1', '': 'fa2', '': 'fa3', '': 'fa4',
'fan': 'fan5', 'fān': 'fan1', 'fán': 'fan2', 'fǎn': 'fan3', 'fàn': 'fan4', 'fang': 'fang5', 'fāng': 'fang1',
'fáng': 'fang2', 'fǎng': 'fang3', 'fàng': 'fang4', 'fei': 'fei5', 'fēi': 'fei1', 'féi': 'fei2', 'fěi': 'fei3',
'fèi': 'fei4', 'fen': 'fen5', 'fēn': 'fen1', 'fén': 'fen2', 'fěn': 'fen3', 'fèn': 'fen4', 'feng': 'feng5',
'fēng': 'feng1', 'féng': 'feng2', 'fěng': 'feng3', 'fèng': 'feng4', 'fo': 'fo5', '': 'fo1', '': 'fo2',
'': 'fo3', '': 'fo4', 'fou': 'fou5', 'fōu': 'fou1', 'fóu': 'fou2', 'fǒu': 'fou3', 'fòu': 'fou4', 'fu': 'fu5',
'': 'fu1', '': 'fu2', '': 'fu3', '': 'fu4', 'ga': 'ga5', '': 'ga1', '': 'ga2', '': 'ga3', '': 'ga4',
'gai': 'gai5', 'gāi': 'gai1', 'gái': 'gai2', 'gǎi': 'gai3', 'gài': 'gai4', 'gan': 'gan5', 'gān': 'gan1',
'gán': 'gan2', 'gǎn': 'gan3', 'gàn': 'gan4', 'gang': 'gang5', 'gāng': 'gang1', 'gáng': 'gang2', 'gǎng': 'gang3',
'gàng': 'gang4', 'gao': 'gao5', 'gāo': 'gao1', 'gáo': 'gao2', 'gǎo': 'gao3', 'gào': 'gao4', 'ge': 'ge5',
'': 'ge1', '': 'ge2', '': 'ge3', '': 'ge4', 'gei': 'gei5', 'gēi': 'gei1', 'géi': 'gei2', 'gěi': 'gei3',
'gèi': 'gei4', 'gen': 'gen5', 'gēn': 'gen1', 'gén': 'gen2', 'gěn': 'gen3', 'gèn': 'gen4', 'geng': 'geng5',
'gēng': 'geng1', 'géng': 'geng2', 'gěng': 'geng3', 'gèng': 'geng4', 'gong': 'gong5', 'gōng': 'gong1',
'góng': 'gong2', 'gǒng': 'gong3', 'gòng': 'gong4', 'gou': 'gou5', 'gōu': 'gou1', 'góu': 'gou2', 'gǒu': 'gou3',
'gòu': 'gou4', 'gu': 'gu5', '': 'gu1', '': 'gu2', '': 'gu3', '': 'gu4', 'gua': 'gua5', 'guā': 'gua1',
'guá': 'gua2', 'guǎ': 'gua3', 'guà': 'gua4', 'guai': 'guai5', 'guāi': 'guai1', 'guái': 'guai2', 'guǎi': 'guai3',
'guài': 'guai4', 'guan': 'guan5', 'guān': 'guan1', 'guán': 'guan2', 'guǎn': 'guan3', 'guàn': 'guan4',
'guang': 'guang5', 'guāng': 'guang1', 'guáng': 'guang2', 'guǎng': 'guang3', 'guàng': 'guang4', 'gui': 'gui5',
'guī': 'gui1', 'guí': 'gui2', 'guǐ': 'gui3', 'guì': 'gui4', 'gun': 'gun5', 'gūn': 'gun1', 'gún': 'gun2',
'gǔn': 'gun3', 'gùn': 'gun4', 'guo': 'guo5', 'guō': 'guo1', 'guó': 'guo2', 'guǒ': 'guo3', 'guò': 'guo4',
'ha': 'ha5', '': 'ha1', '': 'ha2', '': 'ha3', '': 'ha4', 'hai': 'hai5', 'hāi': 'hai1', 'hái': 'hai2',
'hǎi': 'hai3', 'hài': 'hai4', 'han': 'han5', 'hān': 'han1', 'hán': 'han2', 'hǎn': 'han3', 'hàn': 'han4',
'hang': 'hang5', 'hāng': 'hang1', 'háng': 'hang2', 'hǎng': 'hang3', 'hàng': 'hang4', 'hao': 'hao5', 'hāo': 'hao1',
'háo': 'hao2', 'hǎo': 'hao3', 'hào': 'hao4', 'he': 'he5', '': 'he1', '': 'he2', '': 'he3', '': 'he4',
'hei': 'hei5', 'hēi': 'hei1', 'héi': 'hei2', 'hěi': 'hei3', 'hèi': 'hei4', 'hen': 'hen5', 'hēn': 'hen1',
'hén': 'hen2', 'hěn': 'hen3', 'hèn': 'hen4', 'heng': 'heng5', 'hēng': 'heng1', 'héng': 'heng2', 'hěng': 'heng3',
'hèng': 'heng4', 'hong': 'hong5', 'hōng': 'hong1', 'hóng': 'hong2', 'hǒng': 'hong3', 'hòng': 'hong4', 'hou': 'hou5',
'hōu': 'hou1', 'hóu': 'hou2', 'hǒu': 'hou3', 'hòu': 'hou4', 'hu': 'hu5', '': 'hu1', '': 'hu2', '': 'hu3',
'': 'hu4', 'hua': 'hua5', 'huā': 'hua1', 'huá': 'hua2', 'huǎ': 'hua3', 'huà': 'hua4', 'huai': 'huai5',
'huāi': 'huai1', 'huái': 'huai2', 'huǎi': 'huai3', 'huài': 'huai4', 'huan': 'huan5', 'huān': 'huan1',
'huán': 'huan2', 'huǎn': 'huan3', 'huàn': 'huan4', 'huang': 'huang5', 'huāng': 'huang1', 'huáng': 'huang2',
'huǎng': 'huang3', 'huàng': 'huang4', 'hui': 'hui5', 'huī': 'hui1', 'huí': 'hui2', 'huǐ': 'hui3', 'huì': 'hui4',
'hun': 'hun5', 'hūn': 'hun1', 'hún': 'hun2', 'hǔn': 'hun3', 'hùn': 'hun4', 'huo': 'huo5', 'huō': 'huo1',
'huó': 'huo2', 'huǒ': 'huo3', 'huò': 'huo4', 'ji': 'ji5', '': 'ji1', '': 'ji2', '': 'ji3', '': 'ji4',
'jia': 'jia5', 'jiā': 'jia1', 'jiá': 'jia2', 'jiǎ': 'jia3', 'jià': 'jia4', 'jian': 'jian5', 'jiān': 'jian1',
'jián': 'jian2', 'jiǎn': 'jian3', 'jiàn': 'jian4', 'jiang': 'jiang5', 'jiāng': 'jiang1', 'jiáng': 'jiang2',
'jiǎng': 'jiang3', 'jiàng': 'jiang4', 'jiao': 'jiao5', 'jiāo': 'jiao1', 'jiáo': 'jiao2', 'jiǎo': 'jiao3',
'jiào': 'jiao4', 'jie': 'jie5', 'jiē': 'jie1', 'jié': 'jie2', 'jiě': 'jie3', 'jiè': 'jie4', 'jin': 'jin5',
'jīn': 'jin1', 'jín': 'jin2', 'jǐn': 'jin3', 'jìn': 'jin4', 'jing': 'jing5', 'jīng': 'jing1', 'jíng': 'jing2',
'jǐng': 'jing3', 'jìng': 'jing4', 'jiong': 'jiong5', 'jiōng': 'jiong1', 'jióng': 'jiong2', 'jiǒng': 'jiong3',
'jiòng': 'jiong4', 'jiu': 'jiu5', 'jiū': 'jiu1', 'jiú': 'jiu2', 'jiǔ': 'jiu3', 'jiù': 'jiu4', 'ju': 'ju5',
'': 'ju1', '': 'ju2', '': 'ju3', '': 'ju4', 'juan': 'juan5', 'juān': 'juan1', 'juán': 'juan2',
'juǎn': 'juan3', 'juàn': 'juan4', 'jue': 'jue5', 'juē': 'jue1', 'jué': 'jue2', 'juě': 'jue3', 'juè': 'jue4',
'jun': 'jun5', 'jūn': 'jun1', 'jún': 'jun2', 'jǔn': 'jun3', 'jùn': 'jun4', 'ka': 'ka5', '': 'ka1', '': 'ka2',
'': 'ka3', '': 'ka4', 'kai': 'kai5', 'kāi': 'kai1', 'kái': 'kai2', 'kǎi': 'kai3', 'kài': 'kai4', 'kan': 'kan5',
'kān': 'kan1', 'kán': 'kan2', 'kǎn': 'kan3', 'kàn': 'kan4', 'kang': 'kang5', 'kāng': 'kang1', 'káng': 'kang2',
'kǎng': 'kang3', 'kàng': 'kang4', 'kao': 'kao5', 'kāo': 'kao1', 'káo': 'kao2', 'kǎo': 'kao3', 'kào': 'kao4',
'ke': 'ke5', '': 'ke1', '': 'ke2', '': 'ke3', '': 'ke4', 'ken': 'ken5', 'kēn': 'ken1', 'kén': 'ken2',
'kěn': 'ken3', 'kèn': 'ken4', 'keng': 'keng5', 'kēng': 'keng1', 'kéng': 'keng2', 'kěng': 'keng3', 'kèng': 'keng4',
'kong': 'kong5', 'kōng': 'kong1', 'kóng': 'kong2', 'kǒng': 'kong3', 'kòng': 'kong4', 'kou': 'kou5', 'kōu': 'kou1',
'kóu': 'kou2', 'kǒu': 'kou3', 'kòu': 'kou4', 'ku': 'ku5', '': 'ku1', '': 'ku2', '': 'ku3', '': 'ku4',
'kua': 'kua5', 'kuā': 'kua1', 'kuá': 'kua2', 'kuǎ': 'kua3', 'kuà': 'kua4', 'kuai': 'kuai5', 'kuāi': 'kuai1',
'kuái': 'kuai2', 'kuǎi': 'kuai3', 'kuài': 'kuai4', 'kuan': 'kuan5', 'kuān': 'kuan1', 'kuán': 'kuan2',
'kuǎn': 'kuan3', 'kuàn': 'kuan4', 'kuang': 'kuang5', 'kuāng': 'kuang1', 'kuáng': 'kuang2', 'kuǎng': 'kuang3',
'kuàng': 'kuang4', 'kui': 'kui5', 'kuī': 'kui1', 'kuí': 'kui2', 'kuǐ': 'kui3', 'kuì': 'kui4', 'kun': 'kun5',
'kūn': 'kun1', 'kún': 'kun2', 'kǔn': 'kun3', 'kùn': 'kun4', 'kuo': 'kuo5', 'kuō': 'kuo1', 'kuó': 'kuo2',
'kuǒ': 'kuo3', 'kuò': 'kuo4', 'la': 'la5', '': 'la1', '': 'la2', '': 'la3', '': 'la4', 'lai': 'lai5',
'lāi': 'lai1', 'lái': 'lai2', 'lǎi': 'lai3', 'lài': 'lai4', 'lan': 'lan5', 'lān': 'lan1', 'lán': 'lan2',
'lǎn': 'lan3', 'làn': 'lan4', 'lang': 'lang5', 'lāng': 'lang1', 'láng': 'lang2', 'lǎng': 'lang3', 'làng': 'lang4',
'lao': 'lao5', 'lāo': 'lao1', 'láo': 'lao2', 'lǎo': 'lao3', 'lào': 'lao4', 'le': 'le5', '': 'le1', '': 'le2',
'': 'le3', '': 'le4', 'lei': 'lei5', 'lēi': 'lei1', 'léi': 'lei2', 'lěi': 'lei3', 'lèi': 'lei4',
'leng': 'leng5', 'lēng': 'leng1', 'léng': 'leng2', 'lěng': 'leng3', 'lèng': 'leng4', 'li': 'li5', '': 'li1',
'': 'li2', '': 'li3', '': 'li4', 'lia': 'lia5', 'liā': 'lia1', 'liá': 'lia2', 'liǎ': 'lia3', 'lià': 'lia4',
'lian': 'lian5', 'liān': 'lian1', 'lián': 'lian2', 'liǎn': 'lian3', 'liàn': 'lian4', 'liang': 'liang5',
'liāng': 'liang1', 'liáng': 'liang2', 'liǎng': 'liang3', 'liàng': 'liang4', 'liao': 'liao5', 'liāo': 'liao1',
'liáo': 'liao2', 'liǎo': 'liao3', 'liào': 'liao4', 'lie': 'lie5', 'liē': 'lie1', 'lié': 'lie2', 'liě': 'lie3',
'liè': 'lie4', 'lin': 'lin5', 'līn': 'lin1', 'lín': 'lin2', 'lǐn': 'lin3', 'lìn': 'lin4', 'ling': 'ling5',
'līng': 'ling1', 'líng': 'ling2', 'lǐng': 'ling3', 'lìng': 'ling4', 'liu': 'liu5', 'liū': 'liu1', 'liú': 'liu2',
'liǔ': 'liu3', 'liù': 'liu4', 'lo': 'lo5', '': 'lo1', '': 'lo2', '': 'lo3', '': 'lo4', 'long': 'long5',
'lōng': 'long1', 'lóng': 'long2', 'lǒng': 'long3', 'lòng': 'long4', 'lou': 'lou5', 'lōu': 'lou1', 'lóu': 'lou2',
'lǒu': 'lou3', 'lòu': 'lou4', 'lu': 'lu5', '': 'lu1', '': 'lu2', '': 'lu3', '': 'lu4', 'luan': 'luan5',
'luān': 'luan1', 'luán': 'luan2', 'luǎn': 'luan3', 'luàn': 'luan4', 'lun': 'lun5', 'lūn': 'lun1', 'lún': 'lun2',
'lǔn': 'lun3', 'lùn': 'lun4', 'luo': 'luo5', 'luō': 'luo1', 'luó': 'luo2', 'luǒ': 'luo3', 'luò': 'luo4',
'': 'lv5', '': 'lv1', '': 'lv2', '': 'lv3', '': 'lv4', 'lüe': 'lve5', 'lüē': 'lve1', 'lüé': 'lve2',
'lüě': 'lve3', 'lüè': 'lve4', 'ma': 'ma5', '': 'ma1', '': 'ma2', '': 'ma3', '': 'ma4', 'mai': 'mai5',
'māi': 'mai1', 'mái': 'mai2', 'mǎi': 'mai3', 'mài': 'mai4', 'man': 'man5', 'mān': 'man1', 'mán': 'man2',
'mǎn': 'man3', 'màn': 'man4', 'mang': 'mang5', 'māng': 'mang1', 'máng': 'mang2', 'mǎng': 'mang3', 'màng': 'mang4',
'mao': 'mao5', 'māo': 'mao1', 'máo': 'mao2', 'mǎo': 'mao3', 'mào': 'mao4', 'me': 'me5', '': 'me1', '': 'me2',
'': 'me3', '': 'me4', 'mei': 'mei5', 'mēi': 'mei1', 'méi': 'mei2', 'měi': 'mei3', 'mèi': 'mei4', 'men': 'men5',
'mēn': 'men1', 'mén': 'men2', 'měn': 'men3', 'mèn': 'men4', 'meng': 'meng5', 'mēng': 'meng1', 'méng': 'meng2',
'měng': 'meng3', 'mèng': 'meng4', 'mi': 'mi5', '': 'mi1', '': 'mi2', '': 'mi3', '': 'mi4', 'mian': 'mian5',
'miān': 'mian1', 'mián': 'mian2', 'miǎn': 'mian3', 'miàn': 'mian4', 'miao': 'miao5', 'miāo': 'miao1',
'miáo': 'miao2', 'miǎo': 'miao3', 'miào': 'miao4', 'mie': 'mie5', 'miē': 'mie1', 'mié': 'mie2', 'miě': 'mie3',
'miè': 'mie4', 'min': 'min5', 'mīn': 'min1', 'mín': 'min2', 'mǐn': 'min3', 'mìn': 'min4', 'ming': 'ming5',
'mīng': 'ming1', 'míng': 'ming2', 'mǐng': 'ming3', 'mìng': 'ming4', 'miu': 'miu5', 'miū': 'miu1', 'miú': 'miu2',
'miǔ': 'miu3', 'miù': 'miu4', 'mo': 'mo5', '': 'mo1', '': 'mo2', '': 'mo3', '': 'mo4', 'mou': 'mou5',
'mōu': 'mou1', 'móu': 'mou2', 'mǒu': 'mou3', 'mòu': 'mou4', 'mu': 'mu5', '': 'mu1', '': 'mu2', '': 'mu3',
'': 'mu4', 'na': 'na5', '': 'na1', '': 'na2', '': 'na3', '': 'na4', 'nai': 'nai5', 'nāi': 'nai1',
'nái': 'nai2', 'nǎi': 'nai3', 'nài': 'nai4', 'nan': 'nan5', 'nān': 'nan1', 'nán': 'nan2', 'nǎn': 'nan3',
'nàn': 'nan4', 'nang': 'nang5', 'nāng': 'nang1', 'náng': 'nang2', 'nǎng': 'nang3', 'nàng': 'nang4', 'nao': 'nao5',
'nāo': 'nao1', 'náo': 'nao2', 'nǎo': 'nao3', 'nào': 'nao4', 'ne': 'ne5', '': 'ne1', '': 'ne2', '': 'ne3',
'': 'ne4', 'nei': 'nei5', 'nēi': 'nei1', 'néi': 'nei2', 'něi': 'nei3', 'nèi': 'nei4', 'nen': 'nen5',
'nēn': 'nen1', 'nén': 'nen2', 'něn': 'nen3', 'nèn': 'nen4', 'neng': 'neng5', 'nēng': 'neng1', 'néng': 'neng2',
'něng': 'neng3', 'nèng': 'neng4', 'ni': 'ni5', '': 'ni1', '': 'ni2', '': 'ni3', '': 'ni4', 'nian': 'nian5',
'niān': 'nian1', 'nián': 'nian2', 'niǎn': 'nian3', 'niàn': 'nian4', 'niang': 'niang5', 'niāng': 'niang1',
'niáng': 'niang2', 'niǎng': 'niang3', 'niàng': 'niang4', 'niao': 'niao5', 'niāo': 'niao1', 'niáo': 'niao2',
'niǎo': 'niao3', 'niào': 'niao4', 'nie': 'nie5', 'niē': 'nie1', 'nié': 'nie2', 'niě': 'nie3', 'niè': 'nie4',
'nin': 'nin5', 'nīn': 'nin1', 'nín': 'nin2', 'nǐn': 'nin3', 'nìn': 'nin4', 'ning': 'ning5', 'nīng': 'ning1',
'níng': 'ning2', 'nǐng': 'ning3', 'nìng': 'ning4', 'niu': 'niu5', 'niū': 'niu1', 'niú': 'niu2', 'niǔ': 'niu3',
'niù': 'niu4', 'nong': 'nong5', 'nōng': 'nong1', 'nóng': 'nong2', 'nǒng': 'nong3', 'nòng': 'nong4', 'nou': 'nou5',
'nōu': 'nou1', 'nóu': 'nou2', 'nǒu': 'nou3', 'nòu': 'nou4', 'nu': 'nu5', '': 'nu1', '': 'nu2', '': 'nu3',
'': 'nu4', 'nuan': 'nuan5', 'nuān': 'nuan1', 'nuán': 'nuan2', 'nuǎn': 'nuan3', 'nuàn': 'nuan4', 'nuo': 'nuo5',
'nuō': 'nuo1', 'nuó': 'nuo2', 'nuǒ': 'nuo3', 'nuò': 'nuo4', '': 'nv5', '': 'nv1', '': 'nv2', '': 'nv3',
'': 'nv4', 'nüe': 'nve5', 'nüē': 'nve1', 'nüé': 'nve2', 'nüě': 'nve3', 'nüè': 'nve4', 'o': 'o5', 'ō': 'o1',
'ó': 'o2', 'ǒ': 'o3', 'ò': 'o4', 'ou': 'ou5', 'ōu': 'ou1', 'óu': 'ou2', 'ǒu': 'ou3', 'òu': 'ou4', 'pa': 'pa5',
'': 'pa1', '': 'pa2', '': 'pa3', '': 'pa4', 'pai': 'pai5', 'pāi': 'pai1', 'pái': 'pai2', 'pǎi': 'pai3',
'pài': 'pai4', 'pan': 'pan5', 'pān': 'pan1', 'pán': 'pan2', 'pǎn': 'pan3', 'pàn': 'pan4', 'pang': 'pang5',
'pāng': 'pang1', 'páng': 'pang2', 'pǎng': 'pang3', 'pàng': 'pang4', 'pao': 'pao5', 'pāo': 'pao1', 'páo': 'pao2',
'pǎo': 'pao3', 'pào': 'pao4', 'pei': 'pei5', 'pēi': 'pei1', 'péi': 'pei2', 'pěi': 'pei3', 'pèi': 'pei4',
'pen': 'pen5', 'pēn': 'pen1', 'pén': 'pen2', 'pěn': 'pen3', 'pèn': 'pen4', 'peng': 'peng5', 'pēng': 'peng1',
'péng': 'peng2', 'pěng': 'peng3', 'pèng': 'peng4', 'pi': 'pi5', '': 'pi1', '': 'pi2', '': 'pi3', '': 'pi4',
'pian': 'pian5', 'piān': 'pian1', 'pián': 'pian2', 'piǎn': 'pian3', 'piàn': 'pian4', 'piao': 'piao5',
'piāo': 'piao1', 'piáo': 'piao2', 'piǎo': 'piao3', 'piào': 'piao4', 'pie': 'pie5', 'piē': 'pie1', 'pié': 'pie2',
'piě': 'pie3', 'piè': 'pie4', 'pin': 'pin5', 'pīn': 'pin1', 'pín': 'pin2', 'pǐn': 'pin3', 'pìn': 'pin4',
'ping': 'ping5', 'pīng': 'ping1', 'píng': 'ping2', 'pǐng': 'ping3', 'pìng': 'ping4', 'po': 'po5', '': 'po1',
'': 'po2', '': 'po3', '': 'po4', 'pou': 'pou5', 'pōu': 'pou1', 'póu': 'pou2', 'pǒu': 'pou3', 'pòu': 'pou4',
'pu': 'pu5', '': 'pu1', '': 'pu2', '': 'pu3', '': 'pu4', 'qi': 'qi5', '': 'qi1', '': 'qi2', '': 'qi3',
'': 'qi4', 'qia': 'qia5', 'qiā': 'qia1', 'qiá': 'qia2', 'qiǎ': 'qia3', 'qià': 'qia4', 'qian': 'qian5',
'qiān': 'qian1', 'qián': 'qian2', 'qiǎn': 'qian3', 'qiàn': 'qian4', 'qiang': 'qiang5', 'qiāng': 'qiang1',
'qiáng': 'qiang2', 'qiǎng': 'qiang3', 'qiàng': 'qiang4', 'qiao': 'qiao5', 'qiāo': 'qiao1', 'qiáo': 'qiao2',
'qiǎo': 'qiao3', 'qiào': 'qiao4', 'qie': 'qie5', 'qiē': 'qie1', 'qié': 'qie2', 'qiě': 'qie3', 'qiè': 'qie4',
'qin': 'qin5', 'qīn': 'qin1', 'qín': 'qin2', 'qǐn': 'qin3', 'qìn': 'qin4', 'qing': 'qing5', 'qīng': 'qing1',
'qíng': 'qing2', 'qǐng': 'qing3', 'qìng': 'qing4', 'qiong': 'qiong5', 'qiōng': 'qiong1', 'qióng': 'qiong2',
'qiǒng': 'qiong3', 'qiòng': 'qiong4', 'qiu': 'qiu5', 'qiū': 'qiu1', 'qiú': 'qiu2', 'qiǔ': 'qiu3', 'qiù': 'qiu4',
'qu': 'qu5', '': 'qu1', '': 'qu2', '': 'qu3', '': 'qu4', 'quan': 'quan5', 'quān': 'quan1', 'quán': 'quan2',
'quǎn': 'quan3', 'quàn': 'quan4', 'que': 'que5', 'quē': 'que1', 'qué': 'que2', 'quě': 'que3', 'què': 'que4',
'qun': 'qun5', 'qūn': 'qun1', 'qún': 'qun2', 'qǔn': 'qun3', 'qùn': 'qun4', 'ran': 'ran5', 'rān': 'ran1',
'rán': 'ran2', 'rǎn': 'ran3', 'ràn': 'ran4', 'rang': 'rang5', 'rāng': 'rang1', 'ráng': 'rang2', 'rǎng': 'rang3',
'ràng': 'rang4', 'rao': 'rao5', 'rāo': 'rao1', 'ráo': 'rao2', 'rǎo': 'rao3', 'rào': 'rao4', 're': 're5',
'': 're1', '': 're2', '': 're3', '': 're4', 'ren': 'ren5', 'rēn': 'ren1', 'rén': 'ren2', 'rěn': 'ren3',
'rèn': 'ren4', 'reng': 'reng5', 'rēng': 'reng1', 'réng': 'reng2', 'rěng': 'reng3', 'rèng': 'reng4', 'ri': 'ri5',
'': 'ri1', '': 'ri2', '': 'ri3', '': 'ri4', 'rong': 'rong5', 'rōng': 'rong1', 'róng': 'rong2',
'rǒng': 'rong3', 'ròng': 'rong4', 'rou': 'rou5', 'rōu': 'rou1', 'róu': 'rou2', 'rǒu': 'rou3', 'ròu': 'rou4',
'ru': 'ru5', '': 'ru1', '': 'ru2', '': 'ru3', '': 'ru4', 'ruan': 'ruan5', 'ruān': 'ruan1', 'ruán': 'ruan2',
'ruǎn': 'ruan3', 'ruàn': 'ruan4', 'rui': 'rui5', 'ruī': 'rui1', 'ruí': 'rui2', 'ruǐ': 'rui3', 'ruì': 'rui4',
'run': 'run5', 'rūn': 'run1', 'rún': 'run2', 'rǔn': 'run3', 'rùn': 'run4', 'ruo': 'ruo5', 'ruō': 'ruo1',
'ruó': 'ruo2', 'ruǒ': 'ruo3', 'ruò': 'ruo4', 'sa': 'sa5', '': 'sa1', '': 'sa2', '': 'sa3', '': 'sa4',
'sai': 'sai5', 'sāi': 'sai1', 'sái': 'sai2', 'sǎi': 'sai3', 'sài': 'sai4', 'san': 'san5', 'sān': 'san1',
'sán': 'san2', 'sǎn': 'san3', 'sàn': 'san4', 'sang': 'sang5', 'sāng': 'sang1', 'sáng': 'sang2', 'sǎng': 'sang3',
'sàng': 'sang4', 'sao': 'sao5', 'sāo': 'sao1', 'sáo': 'sao2', 'sǎo': 'sao3', 'sào': 'sao4', 'se': 'se5',
'': 'se1', '': 'se2', '': 'se3', '': 'se4', 'sen': 'sen5', 'sēn': 'sen1', 'sén': 'sen2', 'sěn': 'sen3',
'sèn': 'sen4', 'seng': 'seng5', 'sēng': 'seng1', 'séng': 'seng2', 'sěng': 'seng3', 'sèng': 'seng4', 'sha': 'sha5',
'shā': 'sha1', 'shá': 'sha2', 'shǎ': 'sha3', 'shà': 'sha4', 'shai': 'shai5', 'shāi': 'shai1', 'shái': 'shai2',
'shǎi': 'shai3', 'shài': 'shai4', 'shan': 'shan5', 'shān': 'shan1', 'shán': 'shan2', 'shǎn': 'shan3',
'shàn': 'shan4', 'shang': 'shang5', 'shāng': 'shang1', 'sháng': 'shang2', 'shǎng': 'shang3', 'shàng': 'shang4',
'shao': 'shao5', 'shāo': 'shao1', 'sháo': 'shao2', 'shǎo': 'shao3', 'shào': 'shao4', 'she': 'she5', 'shē': 'she1',
'shé': 'she2', 'shě': 'she3', 'shè': 'she4', 'shei': 'shei5', 'shēi': 'shei1', 'shéi': 'shei2', 'shěi': 'shei3',
'shèi': 'shei4', 'shen': 'shen5', 'shēn': 'shen1', 'shén': 'shen2', 'shěn': 'shen3', 'shèn': 'shen4',
'sheng': 'sheng5', 'shēng': 'sheng1', 'shéng': 'sheng2', 'shěng': 'sheng3', 'shèng': 'sheng4', 'shi': 'shi5',
'shī': 'shi1', 'shí': 'shi2', 'shǐ': 'shi3', 'shì': 'shi4', 'shou': 'shou5', 'shōu': 'shou1', 'shóu': 'shou2',
'shǒu': 'shou3', 'shòu': 'shou4', 'shu': 'shu5', 'shū': 'shu1', 'shú': 'shu2', 'shǔ': 'shu3', 'shù': 'shu4',
'shua': 'shua5', 'shuā': 'shua1', 'shuá': 'shua2', 'shuǎ': 'shua3', 'shuà': 'shua4', 'shuai': 'shuai5',
'shuāi': 'shuai1', 'shuái': 'shuai2', 'shuǎi': 'shuai3', 'shuài': 'shuai4', 'shuan': 'shuan5', 'shuān': 'shuan1',
'shuán': 'shuan2', 'shuǎn': 'shuan3', 'shuàn': 'shuan4', 'shuang': 'shuang5', 'shuāng': 'shuang1',
'shuáng': 'shuang2', 'shuǎng': 'shuang3', 'shuàng': 'shuang4', 'shui': 'shui5', 'shuī': 'shui1', 'shuí': 'shui2',
'shuǐ': 'shui3', 'shuì': 'shui4', 'shun': 'shun5', 'shūn': 'shun1', 'shún': 'shun2', 'shǔn': 'shun3',
'shùn': 'shun4', 'shuo': 'shuo5', 'shuō': 'shuo1', 'shuó': 'shuo2', 'shuǒ': 'shuo3', 'shuò': 'shuo4', 'si': 'si5',
'': 'si1', '': 'si2', '': 'si3', '': 'si4', 'song': 'song5', 'sōng': 'song1', 'sóng': 'song2',
'sǒng': 'song3', 'sòng': 'song4', 'sou': 'sou5', 'sōu': 'sou1', 'sóu': 'sou2', 'sǒu': 'sou3', 'sòu': 'sou4',
'su': 'su5', '': 'su1', '': 'su2', '': 'su3', '': 'su4', 'suan': 'suan5', 'suān': 'suan1', 'suán': 'suan2',
'suǎn': 'suan3', 'suàn': 'suan4', 'sui': 'sui5', 'suī': 'sui1', 'suí': 'sui2', 'suǐ': 'sui3', 'suì': 'sui4',
'sun': 'sun5', 'sūn': 'sun1', 'sún': 'sun2', 'sǔn': 'sun3', 'sùn': 'sun4', 'suo': 'suo5', 'suō': 'suo1',
'suó': 'suo2', 'suǒ': 'suo3', 'suò': 'suo4', 'ta': 'ta5', '': 'ta1', '': 'ta2', '': 'ta3', '': 'ta4',
'tai': 'tai5', 'tāi': 'tai1', 'tái': 'tai2', 'tǎi': 'tai3', 'tài': 'tai4', 'tan': 'tan5', 'tān': 'tan1',
'tán': 'tan2', 'tǎn': 'tan3', 'tàn': 'tan4', 'tang': 'tang5', 'tāng': 'tang1', 'táng': 'tang2', 'tǎng': 'tang3',
'tàng': 'tang4', 'tao': 'tao5', 'tāo': 'tao1', 'táo': 'tao2', 'tǎo': 'tao3', 'tào': 'tao4', 'te': 'te5',
'': 'te1', '': 'te2', '': 'te3', '': 'te4', 'teng': 'teng5', 'tēng': 'teng1', 'téng': 'teng2',
'těng': 'teng3', 'tèng': 'teng4', 'ti': 'ti5', '': 'ti1', '': 'ti2', '': 'ti3', '': 'ti4', 'tian': 'tian5',
'tiān': 'tian1', 'tián': 'tian2', 'tiǎn': 'tian3', 'tiàn': 'tian4', 'tiao': 'tiao5', 'tiāo': 'tiao1',
'tiáo': 'tiao2', 'tiǎo': 'tiao3', 'tiào': 'tiao4', 'tie': 'tie5', 'tiē': 'tie1', 'tié': 'tie2', 'tiě': 'tie3',
'tiè': 'tie4', 'ting': 'ting5', 'tīng': 'ting1', 'tíng': 'ting2', 'tǐng': 'ting3', 'tìng': 'ting4', 'tong': 'tong5',
'tōng': 'tong1', 'tóng': 'tong2', 'tǒng': 'tong3', 'tòng': 'tong4', 'tou': 'tou5', 'tōu': 'tou1', 'tóu': 'tou2',
'tǒu': 'tou3', 'tòu': 'tou4', 'tu': 'tu5', '': 'tu1', '': 'tu2', '': 'tu3', '': 'tu4', 'tuan': 'tuan5',
'tuān': 'tuan1', 'tuán': 'tuan2', 'tuǎn': 'tuan3', 'tuàn': 'tuan4', 'tui': 'tui5', 'tuī': 'tui1', 'tuí': 'tui2',
'tuǐ': 'tui3', 'tuì': 'tui4', 'tun': 'tun5', 'tūn': 'tun1', 'tún': 'tun2', 'tǔn': 'tun3', 'tùn': 'tun4',
'tuo': 'tuo5', 'tuō': 'tuo1', 'tuó': 'tuo2', 'tuǒ': 'tuo3', 'tuò': 'tuo4', 'wa': 'wa5', '': 'wa1', '': 'wa2',
'': 'wa3', '': 'wa4', 'wai': 'wai5', 'wāi': 'wai1', 'wái': 'wai2', 'wǎi': 'wai3', 'wài': 'wai4', 'wan': 'wan5',
'wān': 'wan1', 'wán': 'wan2', 'wǎn': 'wan3', 'wàn': 'wan4', 'wang': 'wang5', 'wāng': 'wang1', 'wáng': 'wang2',
'wǎng': 'wang3', 'wàng': 'wang4', 'wei': 'wei5', 'wēi': 'wei1', 'wéi': 'wei2', 'wěi': 'wei3', 'wèi': 'wei4',
'wen': 'wen5', 'wēn': 'wen1', 'wén': 'wen2', 'wěn': 'wen3', 'wèn': 'wen4', 'weng': 'weng5', 'wēng': 'weng1',
'wéng': 'weng2', 'wěng': 'weng3', 'wèng': 'weng4', 'wo': 'wo5', '': 'wo1', '': 'wo2', '': 'wo3', '': 'wo4',
'wu': 'wu5', '': 'wu1', '': 'wu2', '': 'wu3', '': 'wu4', 'xi': 'xi5', '': 'xi1', '': 'xi2', '': 'xi3',
'': 'xi4', 'xia': 'xia5', 'xiā': 'xia1', 'xiá': 'xia2', 'xiǎ': 'xia3', 'xià': 'xia4', 'xian': 'xian5',
'xiān': 'xian1', 'xián': 'xian2', 'xiǎn': 'xian3', 'xiàn': 'xian4', 'xiang': 'xiang5', 'xiāng': 'xiang1',
'xiáng': 'xiang2', 'xiǎng': 'xiang3', 'xiàng': 'xiang4', 'xiao': 'xiao5', 'xiāo': 'xiao1', 'xiáo': 'xiao2',
'xiǎo': 'xiao3', 'xiào': 'xiao4', 'xie': 'xie5', 'xiē': 'xie1', 'xié': 'xie2', 'xiě': 'xie3', 'xiè': 'xie4',
'xin': 'xin5', 'xīn': 'xin1', 'xín': 'xin2', 'xǐn': 'xin3', 'xìn': 'xin4', 'xing': 'xing5', 'xīng': 'xing1',
'xíng': 'xing2', 'xǐng': 'xing3', 'xìng': 'xing4', 'xiong': 'xiong5', 'xiōng': 'xiong1', 'xióng': 'xiong2',
'xiǒng': 'xiong3', 'xiòng': 'xiong4', 'xiu': 'xiu5', 'xiū': 'xiu1', 'xiú': 'xiu2', 'xiǔ': 'xiu3', 'xiù': 'xiu4',
'xu': 'xu5', '': 'xu1', '': 'xu2', '': 'xu3', '': 'xu4', 'xuan': 'xuan5', 'xuān': 'xuan1', 'xuán': 'xuan2',
'xuǎn': 'xuan3', 'xuàn': 'xuan4', 'xue': 'xue5', 'xuē': 'xue1', 'xué': 'xue2', 'xuě': 'xue3', 'xuè': 'xue4',
'xun': 'xun5', 'xūn': 'xun1', 'xún': 'xun2', 'xǔn': 'xun3', 'xùn': 'xun4', 'ya': 'ya5', '': 'ya1', '': 'ya2',
'': 'ya3', '': 'ya4', 'yan': 'yan5', 'yān': 'yan1', 'yán': 'yan2', 'yǎn': 'yan3', 'yàn': 'yan4',
'yang': 'yang5', 'yāng': 'yang1', 'yáng': 'yang2', 'yǎng': 'yang3', 'yàng': 'yang4', 'yao': 'yao5', 'yāo': 'yao1',
'yáo': 'yao2', 'yǎo': 'yao3', 'yào': 'yao4', 'ye': 'ye5', '': 'ye1', '': 'ye2', '': 'ye3', '': 'ye4',
'yi': 'yi5', '': 'yi1', '': 'yi2', '': 'yi3', '': 'yi4', 'yin': 'yin5', 'yīn': 'yin1', 'yín': 'yin2',
'yǐn': 'yin3', 'yìn': 'yin4', 'ying': 'ying5', 'yīng': 'ying1', 'yíng': 'ying2', 'yǐng': 'ying3', 'yìng': 'ying4',
'yo': 'yo5', '': 'yo1', '': 'yo2', '': 'yo3', '': 'yo4', 'yong': 'yong5', 'yōng': 'yong1', 'yóng': 'yong2',
'yǒng': 'yong3', 'yòng': 'yong4', 'you': 'you5', 'yōu': 'you1', 'yóu': 'you2', 'yǒu': 'you3', 'yòu': 'you4',
'yu': 'yu5', '': 'yu1', '': 'yu2', '': 'yu3', '': 'yu4', 'yuan': 'yuan5', 'yuān': 'yuan1', 'yuán': 'yuan2',
'yuǎn': 'yuan3', 'yuàn': 'yuan4', 'yue': 'yue5', 'yuē': 'yue1', 'yué': 'yue2', 'yuě': 'yue3', 'yuè': 'yue4',
'yun': 'yun5', 'yūn': 'yun1', 'yún': 'yun2', 'yǔn': 'yun3', 'yùn': 'yun4', 'za': 'za5', '': 'za1', '': 'za2',
'': 'za3', '': 'za4', 'zai': 'zai5', 'zāi': 'zai1', 'zái': 'zai2', 'zǎi': 'zai3', 'zài': 'zai4', 'zan': 'zan5',
'zān': 'zan1', 'zán': 'zan2', 'zǎn': 'zan3', 'zàn': 'zan4', 'zang': 'zang5', 'zāng': 'zang1', 'záng': 'zang2',
'zǎng': 'zang3', 'zàng': 'zang4', 'zao': 'zao5', 'zāo': 'zao1', 'záo': 'zao2', 'zǎo': 'zao3', 'zào': 'zao4',
'ze': 'ze5', '': 'ze1', '': 'ze2', '': 'ze3', '': 'ze4', 'zei': 'zei5', 'zēi': 'zei1', 'zéi': 'zei2',
'zěi': 'zei3', 'zèi': 'zei4', 'zen': 'zen5', 'zēn': 'zen1', 'zén': 'zen2', 'zěn': 'zen3', 'zèn': 'zen4',
'zeng': 'zeng5', 'zēng': 'zeng1', 'zéng': 'zeng2', 'zěng': 'zeng3', 'zèng': 'zeng4', 'zha': 'zha5', 'zhā': 'zha1',
'zhá': 'zha2', 'zhǎ': 'zha3', 'zhà': 'zha4', 'zhai': 'zhai5', 'zhāi': 'zhai1', 'zhái': 'zhai2', 'zhǎi': 'zhai3',
'zhài': 'zhai4', 'zhan': 'zhan5', 'zhān': 'zhan1', 'zhán': 'zhan2', 'zhǎn': 'zhan3', 'zhàn': 'zhan4',
'zhang': 'zhang5', 'zhāng': 'zhang1', 'zháng': 'zhang2', 'zhǎng': 'zhang3', 'zhàng': 'zhang4', 'zhao': 'zhao5',
'zhāo': 'zhao1', 'zháo': 'zhao2', 'zhǎo': 'zhao3', 'zhào': 'zhao4', 'zhe': 'zhe5', 'zhē': 'zhe1', 'zhé': 'zhe2',
'zhě': 'zhe3', 'zhè': 'zhe4', 'zhen': 'zhen5', 'zhēn': 'zhen1', 'zhén': 'zhen2', 'zhěn': 'zhen3', 'zhèn': 'zhen4',
'zheng': 'zheng5', 'zhēng': 'zheng1', 'zhéng': 'zheng2', 'zhěng': 'zheng3', 'zhèng': 'zheng4', 'zhi': 'zhi5',
'zhī': 'zhi1', 'zhí': 'zhi2', 'zhǐ': 'zhi3', 'zhì': 'zhi4', 'zhong': 'zhong5', 'zhōng': 'zhong1', 'zhóng': 'zhong2',
'zhǒng': 'zhong3', 'zhòng': 'zhong4', 'zhou': 'zhou5', 'zhōu': 'zhou1', 'zhóu': 'zhou2', 'zhǒu': 'zhou3',
'zhòu': 'zhou4', 'zhu': 'zhu5', 'zhū': 'zhu1', 'zhú': 'zhu2', 'zhǔ': 'zhu3', 'zhù': 'zhu4', 'zhua': 'zhua5',
'zhuā': 'zhua1', 'zhuá': 'zhua2', 'zhuǎ': 'zhua3', 'zhuà': 'zhua4', 'zhuai': 'zhuai5', 'zhuāi': 'zhuai1',
'zhuái': 'zhuai2', 'zhuǎi': 'zhuai3', 'zhuài': 'zhuai4', 'zhuan': 'zhuan5', 'zhuān': 'zhuan1', 'zhuán': 'zhuan2',
'zhuǎn': 'zhuan3', 'zhuàn': 'zhuan4', 'zhuang': 'zhuang5', 'zhuāng': 'zhuang1', 'zhuáng': 'zhuang2',
'zhuǎng': 'zhuang3', 'zhuàng': 'zhuang4', 'zhui': 'zhui5', 'zhuī': 'zhui1', 'zhuí': 'zhui2', 'zhuǐ': 'zhui3',
'zhuì': 'zhui4', 'zhun': 'zhun5', 'zhūn': 'zhun1', 'zhún': 'zhun2', 'zhǔn': 'zhun3', 'zhùn': 'zhun4',
'zhuo': 'zhuo5', 'zhuō': 'zhuo1', 'zhuó': 'zhuo2', 'zhuǒ': 'zhuo3', 'zhuò': 'zhuo4', 'zi': 'zi5', '': 'zi1',
'': 'zi2', '': 'zi3', '': 'zi4', 'zong': 'zong5', 'zōng': 'zong1', 'zóng': 'zong2', 'zǒng': 'zong3',
'zòng': 'zong4', 'zou': 'zou5', 'zōu': 'zou1', 'zóu': 'zou2', 'zǒu': 'zou3', 'zòu': 'zou4', 'zu': 'zu5',
'': 'zu1', '': 'zu2', '': 'zu3', '': 'zu4', 'zuan': 'zuan5', 'zuān': 'zuan1', 'zuán': 'zuan2',
'zuǎn': 'zuan3', 'zuàn': 'zuan4', 'zui': 'zui5', 'zuī': 'zui1', 'zuí': 'zui2', 'zuǐ': 'zui3', 'zuì': 'zui4',
'zun': 'zun5', 'zūn': 'zun1', 'zún': 'zun2', 'zǔn': 'zun3', 'zùn': 'zun4', 'zuo': 'zuo5', 'zuō': 'zuo1',
'zuó': 'zuo2', 'zuǒ': 'zuo3', 'zuò': 'zuo4', 'zhei': 'zhei5', 'zhēi': 'zhei1', 'zhéi': 'zhei2', 'zhěi': 'zhei3',
'zhèi': 'zhei4', 'kei': 'kei5', 'kēi': 'kei1', 'kéi': 'kei2', 'kěi': 'kei3', 'kèi': 'kei4', 'tei': 'tei5',
'tēi': 'tei1', 'téi': 'tei2', 'těi': 'tei3', 'tèi': 'tei4', 'len': 'len5', 'lēn': 'len1', 'lén': 'len2',
'lěn': 'len3', 'lèn': 'len4', 'nun': 'nun5', 'nūn': 'nun1', 'nún': 'nun2', 'nǔn': 'nun3', 'nùn': 'nun4',
'nia': 'nia5', 'niā': 'nia1', 'niá': 'nia2', 'niǎ': 'nia3', 'nià': 'nia4', 'rua': 'rua5', 'ruā': 'rua1',
'ruá': 'rua2', 'ruǎ': 'rua3', 'ruà': 'rua4', 'fiao': 'fiao5', 'fiāo': 'fiao1', 'fiáo': 'fiao2', 'fiǎo': 'fiao3',
'fiào': 'fiao4', 'cei': 'cei5', 'cēi': 'cei1', 'céi': 'cei2', 'cěi': 'cei3', 'cèi': 'cei4', 'wong': 'wong5',
'wōng': 'wong1', 'wóng': 'wong2', 'wǒng': 'wong3', 'wòng': 'wong4', 'din': 'din5', 'dīn': 'din1', 'dín': 'din2',
'dǐn': 'din3', 'dìn': 'din4', 'chua': 'chua5', 'chuā': 'chua1', 'chuá': 'chua2', 'chuǎ': 'chua3', 'chuà': 'chua4',
'n': 'n5', 'n1': 'n1', 'ń': 'n2', 'ň': 'n3', 'ǹ': 'n4', 'ng': 'ng5', 'ng1': 'ng1', 'ńg': 'ng2', 'ňg': 'ng3',
'ǹg': 'ng4'}
shengyundiao2guobiao_dict = {v: k for k, v in guobiao2shengyundiao_dict.items()}
def guobiao2shengyundiao(pinyin_list):
"""国标样式拼音转为声母韵母音调样式的拼音。"""
out = []
for pin in pinyin_list:
out.append(guobiao2shengyundiao_dict.get(pin))
return out
def shengyundiao2guobiao(pinyin_list):
"""声母韵母音调样式的拼音转为国标样式的拼音。"""
out = []
for pin in pinyin_list:
out.append(shengyundiao2guobiao_dict.get(pin))
return out
if __name__ == "__main__":
logger.info(__file__)
out = shengyundiao2guobiao('ni2 hao3 a5'.split())
assert out == ['', 'hǎo', 'a']
out = guobiao2shengyundiao(out)
assert out == ['ni2', 'hao3', 'a5']

@ -0,0 +1,78 @@
#!usr/bin/env python
# -*- coding: utf-8 -*-
# author: kuangdd
# date: 2020/2/16
"""
#### symbol
音素标记
中文音素简单英文音素简单中文音素
"""
_pad = '_' # 填充符
_eos = '~' # 结束符
_chain = '-' # 连接符,连接读音单位
_oov = '*'
# 中文音素表
# 声母27
_shengmu = [
'aa', 'b', 'c', 'ch', 'd', 'ee', 'f', 'g', 'h', 'ii', 'j', 'k', 'l', 'm', 'n', 'oo', 'p', 'q', 'r', 's', 'sh',
't', 'uu', 'vv', 'x', 'z', 'zh'
]
# 韵母41
_yunmu = [
'a', 'ai', 'an', 'ang', 'ao', 'e', 'ei', 'en', 'eng', 'er', 'i', 'ia', 'ian', 'iang', 'iao', 'ie', 'in', 'ing',
'iong', 'iu', 'ix', 'iy', 'iz', 'o', 'ong', 'ou', 'u', 'ua', 'uai', 'uan', 'uang', 'ueng', 'ui', 'un', 'uo', 'v',
'van', 've', 'vn', 'ng', 'uong'
]
# 声调5
_shengdiao = ['1', '2', '3', '4', '5']
# 字母26
_alphabet = 'Aa Bb Cc Dd Ee Ff Gg Hh Ii Jj Kk Ll Mm Nn Oo Pp Qq Rr Ss Tt Uu Vv Ww Xx Yy Zz'.split()
# 英文26
_english = 'A B C D E F G H I J K L M N O P Q R S T U V W X Y Z'.split()
# 标点10
_biaodian = '! ? . , ; : " # ( )'.split()
# 注:!=!|?=?|.=.。|,=,,、|;=;|:=:|"="“|#= \t|(=([{{【<《|)=)]}}】>》
# 其他7
_other = 'w y 0 6 7 8 9'.split()
# 大写字母26
_upper = list('ABCDEFGHIJKLMNOPQRSTUVWXYZ')
# 小写字母26
_lower = list('abcdefghijklmnopqrstuvwxyz')
# 标点符号12
_punctuation = list('!\'"(),-.:;? ')
# 数字10
_digit = list('0123456789')
# 字母和符号64
# 用于英文:ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz!\'"(),-.:;?\s
_character_en = _upper + _lower + _punctuation
# 字母、数字和符号74
# 用于英文或中文:ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz!\'"(),-.:;?\s0123456789
_character_cn = _upper + _lower + _punctuation + _digit
# 中文音素145
# 支持中文环境、英文环境、中英混合环境,中文把文字转为清华大学标准的音素表示
symbol_chinese = [_pad, _eos, _chain] + _shengmu + _yunmu + _shengdiao + _alphabet + _english + _biaodian + _other
# 简单英文音素66
# 支持英文环境
# ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz!\'"(),-.:;?\s
symbol_english_simple = [_pad, _eos] + _upper + _lower + _punctuation
# 简单中文音素76
# 支持英文、中文环境,中文把文字转为拼音字符串
# ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz!\'"(),-.:;?\s0123456789
symbol_chinese_simple = [_pad, _eos] + _upper + _lower + _punctuation + _digit

@ -0,0 +1,19 @@
Copyright (c) 2017 Keith Ito
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in
all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
THE SOFTWARE.

@ -0,0 +1,116 @@
"""
### english
from https://github.com/keithito/tacotron "
Cleaners are transformations that run over the input text at both training and eval time.
Cleaners can be selected by passing a comma-delimited list of cleaner names as the "cleaners"
hyperparameter. Some cleaners are English-specific. You'll typically want to use:
1. "english_cleaners" for English text
2. "transliteration_cleaners" for non-English text that can be transliterated to ASCII using
the Unidecode library (https://pypi.python.org/pypi/Unidecode)
3. "basic_cleaners" if you do not want to transliterate (in this case, you should also update
the symbols in symbols.py to match your data).
"""
import re
import random
from . import cleaners
from .symbols import symbols
# Mappings from symbol to numeric ID and vice versa:
_symbol_to_id = {s: i for i, s in enumerate(symbols)}
_id_to_symbol = {i: s for i, s in enumerate(symbols)}
# Regular expression matching text enclosed in curly braces:
_curly_re = re.compile(r'(.*?)\{(.+?)\}(.*)')
def get_arpabet(word, dictionary):
word_arpabet = dictionary.lookup(word)
if word_arpabet is not None:
return "{" + word_arpabet[0] + "}"
else:
return word
def text_to_sequence(text, cleaner_names, dictionary=None, p_arpabet=1.0):
'''Converts a string of text to a sequence of IDs corresponding to the symbols in the text.
The text can optionally have ARPAbet sequences enclosed in curly braces embedded
in it. For example, "Turn left on {HH AW1 S S T AH0 N} Street."
Args:
text: string to convert to a sequence
cleaner_names: names of the cleaner functions to run the text through
dictionary: arpabet class with arpabet dictionary
Returns:
List of integers corresponding to the symbols in the text
'''
sequence = []
space = _symbols_to_sequence(' ')
# Check for curly braces and treat their contents as ARPAbet:
while len(text):
m = _curly_re.match(text)
if not m:
clean_text = _clean_text(text, cleaner_names)
if dictionary is not None:
clean_text = [get_arpabet(w, dictionary)
if random.random() < p_arpabet else w
for w in clean_text.split(" ")]
for i in range(len(clean_text)):
t = clean_text[i]
if t.startswith("{"):
sequence += _arpabet_to_sequence(t[1:-1])
else:
sequence += _symbols_to_sequence(t)
sequence += space
else:
sequence += _symbols_to_sequence(clean_text)
break
clean_text = _clean_text(text, cleaner_names)
sequence += _symbols_to_sequence(_clean_text(m.group(1), cleaner_names))
sequence += _arpabet_to_sequence(m.group(2))
text = m.group(3)
# remove trailing space
sequence = sequence[:-1] if sequence[-1] == space[0] else sequence
return sequence
def sequence_to_text(sequence):
'''Converts a sequence of IDs back to a string'''
result = []
for symbol_id in sequence:
if symbol_id in _id_to_symbol:
s = _id_to_symbol[symbol_id]
# Enclose ARPAbet back in curly braces:
if len(s) > 1 and s[0] == '@':
s = '{%s}' % s[1:]
result.append(s)
result = ''.join(result)
return result.replace('}{', ' ')
def _clean_text(text, cleaner_names):
for name in cleaner_names:
cleaner = getattr(cleaners, name)
if not cleaner:
raise Exception('Unknown cleaner: %s' % name)
text = cleaner(text)
return text
def _symbols_to_sequence(symbols):
return [_symbol_to_id[s] for s in symbols if _should_keep_symbol(s)]
def _arpabet_to_sequence(text):
return _symbols_to_sequence(['@' + s for s in text.split()])
def _should_keep_symbol(s):
return s in _symbol_to_id and s is not '_' and s is not '~'

@ -0,0 +1,91 @@
'''
### english
from https://github.com/keithito/tacotron "
Cleaners are transformations that run over the input text at both training and eval time.
Cleaners can be selected by passing a comma-delimited list of cleaner names as the "cleaners"
hyperparameter. Some cleaners are English-specific. You'll typically want to use:
1. "english_cleaners" for English text
2. "transliteration_cleaners" for non-English text that can be transliterated to ASCII using
the Unidecode library (https://pypi.python.org/pypi/Unidecode)
3. "basic_cleaners" if you do not want to transliterate (in this case, you should also update
the symbols in symbols.py to match your data).
'''
import re
from unidecode import unidecode
from .numbers import normalize_numbers
# Regular expression matching whitespace:
_whitespace_re = re.compile(r'\s+')
# List of (regular expression, replacement) pairs for abbreviations:
_abbreviations = [(re.compile('\\b%s\\.' % x[0], re.IGNORECASE), x[1]) for x in [
('mrs', 'misess'),
('mr', 'mister'),
('dr', 'doctor'),
('st', 'saint'),
('co', 'company'),
('jr', 'junior'),
('maj', 'major'),
('gen', 'general'),
('drs', 'doctors'),
('rev', 'reverend'),
('lt', 'lieutenant'),
('hon', 'honorable'),
('sgt', 'sergeant'),
('capt', 'captain'),
('esq', 'esquire'),
('ltd', 'limited'),
('col', 'colonel'),
('ft', 'fort'),
]]
def expand_abbreviations(text):
for regex, replacement in _abbreviations:
text = re.sub(regex, replacement, text)
return text
def expand_numbers(text):
return normalize_numbers(text)
def lowercase(text):
return text.lower()
def collapse_whitespace(text):
return re.sub(_whitespace_re, ' ', text)
def convert_to_ascii(text):
return unidecode(text)
def basic_cleaners(text):
'''Basic pipeline that lowercases and collapses whitespace without transliteration.'''
text = lowercase(text)
text = collapse_whitespace(text)
return text
def transliteration_cleaners(text):
'''Pipeline for non-English text that transliterates to ASCII.'''
text = convert_to_ascii(text)
text = lowercase(text)
text = collapse_whitespace(text)
return text
def english_cleaners(text):
'''Pipeline for English text, including number and abbreviation expansion.'''
text = convert_to_ascii(text)
text = lowercase(text)
text = expand_numbers(text)
text = expand_abbreviations(text)
text = collapse_whitespace(text)
return text

File diff suppressed because it is too large Load Diff

@ -0,0 +1,65 @@
""" from https://github.com/keithito/tacotron """
import re
valid_symbols = [
'AA', 'AA0', 'AA1', 'AA2', 'AE', 'AE0', 'AE1', 'AE2', 'AH', 'AH0', 'AH1', 'AH2',
'AO', 'AO0', 'AO1', 'AO2', 'AW', 'AW0', 'AW1', 'AW2', 'AY', 'AY0', 'AY1', 'AY2',
'B', 'CH', 'D', 'DH', 'EH', 'EH0', 'EH1', 'EH2', 'ER', 'ER0', 'ER1', 'ER2', 'EY',
'EY0', 'EY1', 'EY2', 'F', 'G', 'HH', 'IH', 'IH0', 'IH1', 'IH2', 'IY', 'IY0', 'IY1',
'IY2', 'JH', 'K', 'L', 'M', 'N', 'NG', 'OW', 'OW0', 'OW1', 'OW2', 'OY', 'OY0',
'OY1', 'OY2', 'P', 'R', 'S', 'SH', 'T', 'TH', 'UH', 'UH0', 'UH1', 'UH2', 'UW',
'UW0', 'UW1', 'UW2', 'V', 'W', 'Y', 'Z', 'ZH'
]
_valid_symbol_set = set(valid_symbols)
class CMUDict:
'''Thin wrapper around CMUDict data. http://www.speech.cs.cmu.edu/cgi-bin/cmudict'''
def __init__(self, file_or_path, keep_ambiguous=True):
if isinstance(file_or_path, str):
with open(file_or_path, encoding='latin-1') as f:
entries = _parse_cmudict(f)
else:
entries = _parse_cmudict(file_or_path)
if not keep_ambiguous:
entries = {word: pron for word, pron in entries.items() if len(pron) == 1}
self._entries = entries
def __len__(self):
return len(self._entries)
def lookup(self, word):
'''Returns list of ARPAbet pronunciations of the given word.'''
return self._entries.get(word.upper())
_alt_re = re.compile(r'\([0-9]+\)')
def _parse_cmudict(file):
cmudict = {}
for line in file:
if len(line) and (line[0] >= 'A' and line[0] <= 'Z' or line[0] == "'"):
parts = line.split(' ')
word = re.sub(_alt_re, '', parts[0])
pronunciation = _get_pronunciation(parts[1])
if pronunciation:
if word in cmudict:
cmudict[word].append(pronunciation)
else:
cmudict[word] = [pronunciation]
return cmudict
def _get_pronunciation(s):
parts = s.strip().split(' ')
for part in parts:
if part not in _valid_symbol_set:
return None
return ' '.join(parts)

@ -0,0 +1,71 @@
""" from https://github.com/keithito/tacotron """
import inflect
import re
_inflect = inflect.engine()
_comma_number_re = re.compile(r'([0-9][0-9\,]+[0-9])')
_decimal_number_re = re.compile(r'([0-9]+\.[0-9]+)')
_pounds_re = re.compile(r'£([0-9\,]*[0-9]+)')
_dollars_re = re.compile(r'\$([0-9\.\,]*[0-9]+)')
_ordinal_re = re.compile(r'[0-9]+(st|nd|rd|th)')
_number_re = re.compile(r'[0-9]+')
def _remove_commas(m):
return m.group(1).replace(',', '')
def _expand_decimal_point(m):
return m.group(1).replace('.', ' point ')
def _expand_dollars(m):
match = m.group(1)
parts = match.split('.')
if len(parts) > 2:
return match + ' dollars' # Unexpected format
dollars = int(parts[0]) if parts[0] else 0
cents = int(parts[1]) if len(parts) > 1 and parts[1] else 0
if dollars and cents:
dollar_unit = 'dollar' if dollars == 1 else 'dollars'
cent_unit = 'cent' if cents == 1 else 'cents'
return '%s %s, %s %s' % (dollars, dollar_unit, cents, cent_unit)
elif dollars:
dollar_unit = 'dollar' if dollars == 1 else 'dollars'
return '%s %s' % (dollars, dollar_unit)
elif cents:
cent_unit = 'cent' if cents == 1 else 'cents'
return '%s %s' % (cents, cent_unit)
else:
return 'zero dollars'
def _expand_ordinal(m):
return _inflect.number_to_words(m.group(0))
def _expand_number(m):
num = int(m.group(0))
if num > 1000 and num < 3000:
if num == 2000:
return 'two thousand'
elif num > 2000 and num < 2010:
return 'two thousand ' + _inflect.number_to_words(num % 100)
elif num % 100 == 0:
return _inflect.number_to_words(num // 100) + ' hundred'
else:
return _inflect.number_to_words(num, andword='', zero='oh', group=2).replace(', ', ' ')
else:
return _inflect.number_to_words(num, andword='')
def normalize_numbers(text):
text = re.sub(_comma_number_re, _remove_commas, text)
text = re.sub(_pounds_re, r'\1 pounds', text)
text = re.sub(_dollars_re, _expand_dollars, text)
text = re.sub(_decimal_number_re, _expand_decimal_point, text)
text = re.sub(_ordinal_re, _expand_ordinal, text)
text = re.sub(_number_re, _expand_number, text)
return text

@ -0,0 +1,21 @@
""" from https://github.com/keithito/tacotron """
'''
Defines the set of symbols used in text input to the model.
The default is a set of ASCII characters that works well for English or text that has been run through Unidecode. For other data, you can modify _characters. See TRAINING_DATA.md for details. '''
from . import cmudict
_punctuation = '!\'",.:;? '
_math = '#%&*+-/[]()'
_special = '_@©°½—₩€$'
_accented = 'áçéêëñöøćž'
_numbers = '0123456789'
_letters = 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'
# Prepend "@" to ARPAbet symbols to ensure uniqueness (some are the same as
# uppercase letters):
_arpabet = ['@' + s for s in cmudict.valid_symbols]
# Export all symbols:
symbols = list(_punctuation + _math + _special + _accented + _numbers + _letters) + _arpabet

@ -0,0 +1,294 @@
汉字拼音转换工具Python 版)
=============================
|Build| |appveyor| |Coverage| |Pypi version| |DOI|
将汉字转为拼音。可以用于汉字注音、排序、检索(`Russian translation`_) 。
基于 `hotoo/pinyin <https://github.com/hotoo/pinyin>`__ 开发。
* Documentation: http://pypinyin.rtfd.io/
* GitHub: https://github.com/mozillazg/python-pinyin
* License: MIT license
* PyPI: https://pypi.org/project/pypinyin
* Python version: 2.7, pypy, pypy3, 3.4, 3.5, 3.6, 3.7, 3.8
.. contents::
特性
----
* 根据词组智能匹配最正确的拼音。
* 支持多音字。
* 简单的繁体支持, 注音支持。
* 支持多种不同拼音/注音风格。
安装
----
.. code-block:: bash
$ pip install pypinyin
使用示例
--------
Python 3(Python 2 下把 ``'中心'`` 替换为 ``u'中心'`` 即可):
.. code-block:: python
>>> from pypinyin import pinyin, lazy_pinyin, Style
>>> pinyin('中心')
[['zhōng'], ['xīn']]
>>> pinyin('中心', heteronym=True) # 启用多音字模式
[['zhōng', 'zhòng'], ['xīn']]
>>> pinyin('中心', style=Style.FIRST_LETTER) # 设置拼音风格
[['z'], ['x']]
>>> pinyin('中心', style=Style.TONE2, heteronym=True)
[['zho1ng', 'zho4ng'], ['xi1n']]
>>> pinyin('中心', style=Style.TONE3, heteronym=True)
[['zhong1', 'zhong4'], ['xin1']]
>>> pinyin('中心', style=Style.BOPOMOFO) # 注音风格
[['ㄓㄨㄥ'], ['ㄒㄧㄣ']]
>>> lazy_pinyin('中心') # 不考虑多音字的情况
['zhong', 'xin']
**注意事项**
* 拼音结果不会标明哪个韵母是轻声,轻声的韵母没有声调或数字标识(使用 ``5`` 标识轻声的方法见 `文档 <https://pypinyin.readthedocs.io/zh_CN/master/contrib.html#neutraltonewith5mixin>`__ )。
* 无声调相关拼音风格下的结果会使用 ``v`` 表示 ``ü`` (使用 ``ü`` 代替 ``v`` 的方法见 `文档 <https://pypinyin.readthedocs.io/zh_CN/master/contrib.html#v2umixin>`__ )。
命令行工具:
.. code-block:: console
$ pypinyin 音乐
yīn yuè
$ pypinyin -h
文档
--------
详细文档请访问http://pypinyin.rtfd.io/ 。
项目代码开发方面的问题可以看看 `开发文档`_ 。
FAQ
---------
词语中的多音字拼音有误?
+++++++++++++++++++++++++++++
目前是通过词组拼音库的方式来解决多音字问题的。如果出现拼音有误的情况,
可以自定义词组拼音来调整词语中的拼音:
.. code-block:: python
>>> from pypinyin import Style, pinyin, load_phrases_dict
>>> pinyin('步履蹒跚')
[['bù'], ['lǚ'], ['mán'], ['shān']]
>>> load_phrases_dict({'步履蹒跚': [['bù'], ['lǚ'], ['pán'], ['shān']]})
>>> pinyin('步履蹒跚')
[['bù'], ['lǚ'], ['pán'], ['shān']]
详见 `文档 <https://pypinyin.readthedocs.io/zh_CN/master/usage.html#custom-dict>`__ 。
为什么没有 y, w, yu 几个声母?
++++++++++++++++++++++++++++++++++++++++++++
.. code-block:: python
>>> from pypinyin import Style, pinyin
>>> pinyin('下雨天', style=Style.INITIALS)
[['x'], [''], ['t']]
因为根据 `《汉语拼音方案》 <http://www.moe.edu.cn/s78/A19/yxs_left/moe_810/s230/195802/t19580201_186000.html>`__
ywü (yu) 都不是声母。
声母风格INITIALS“雨”、“我”、“圆”等汉字返回空字符串因为根据
`《汉语拼音方案》 <http://www.moe.edu.cn/s78/A19/yxs_left/moe_810/s230/195802/t19580201_186000.html>`__
ywü (yu) 都不是声母,在某些特定韵母无声母时,才加上 y 或 w而 ü 也有其特定规则。 —— @hotoo
**如果你觉得这个给你带来了麻烦,那么也请小心一些无声母的汉字(如“啊”、“饿”、“按”、“昂”等)。
这时候你也许需要的是首字母风格FIRST_LETTER**。 —— @hotoo
参考: `hotoo/pinyin#57 <https://github.com/hotoo/pinyin/issues/57>`__,
`#22 <https://github.com/mozillazg/python-pinyin/pull/22>`__,
`#27 <https://github.com/mozillazg/python-pinyin/issues/27>`__,
`#44 <https://github.com/mozillazg/python-pinyin/issues/44>`__
如果觉得这个行为不是你想要的,就是想把 y 当成声母的话,可以指定 ``strict=False``
这个可能会符合你的预期:
.. code-block:: python
>>> from pypinyin import Style, pinyin
>>> pinyin('下雨天', style=Style.INITIALS)
[['x'], [''], ['t']]
>>> pinyin('下雨天', style=Style.INITIALS, strict=False)
[['x'], ['y'], ['t']]
详见 `strict 参数的影响`_ 。
如何减少内存占用
++++++++++++++++++++
如果对拼音的准确性不是特别在意的话,可以通过设置环境变量 ``PYPINYIN_NO_PHRASES``
和 ``PYPINYIN_NO_DICT_COPY`` 来节省内存。
详见 `文档 <https://pypinyin.readthedocs.io/zh_CN/master/faq.html#no-phrases>`__
更多 FAQ 详见文档中的
`FAQ <https://pypinyin.readthedocs.io/zh_CN/master/faq.html>`__ 部分。
.. _#13 : https://github.com/mozillazg/python-pinyin/issues/113
.. _strict 参数的影响: https://pypinyin.readthedocs.io/zh_CN/master/usage.html#strict
拼音数据
---------
* 单个汉字的拼音使用 `pinyin-data`_ 的数据
* 词组的拼音使用 `phrase-pinyin-data`_ 的数据
Related Projects
-----------------
* `hotoo/pinyin`__: 汉字拼音转换工具 Node.js/JavaScript 版。
* `mozillazg/go-pinyin`__: 汉字拼音转换工具 Go 版。
* `mozillazg/rust-pinyin`__: 汉字拼音转换工具 Rust 版。
__ https://github.com/hotoo/pinyin
__ https://github.com/mozillazg/go-pinyin
__ https://github.com/mozillazg/rust-pinyin
.. |Build| image:: https://img.shields.io/circleci/project/github/mozillazg/python-pinyin/master.svg
:target: https://circleci.com/gh/mozillazg/python-pinyin
.. |appveyor| image:: https://ci.appveyor.com/api/projects/status/ni8gdyextfa85yqo/branch/master?svg=true
:target: https://ci.appveyor.com/project/mozillazg/python-pinyin
.. |Coverage| image:: https://img.shields.io/codecov/c/github/mozillazg/python-pinyin/master.svg
:target: https://codecov.io/gh/mozillazg/python-pinyin
.. |PyPI version| image:: https://img.shields.io/pypi/v/pypinyin.svg
:target: https://pypi.org/project/pypinyin/
.. |DOI| image:: https://zenodo.org/badge/12830126.svg
:target: https://zenodo.org/badge/latestdoi/12830126
.. _Russian translation: https://github.com/mozillazg/python-pinyin/blob/master/README_ru.rst
.. _pinyin-data: https://github.com/mozillazg/pinyin-data
.. _phrase-pinyin-data: https://github.com/mozillazg/phrase-pinyin-data
.. _开发文档: https://pypinyin.readthedocs.io/zh_CN/develop/develop.html
# pinyin-data [![Build Status](https://travis-ci.org/mozillazg/pinyin-data.svg?branch=master)](https://travis-ci.org/mozillazg/pinyin-data)
汉字拼音数据。
## 数据介绍
拼音数据的格式:
{code point}: {pinyins} # {hanzi} {comments}
* 以 `#` 开头的行是注释,行内 `#` 后面的字符也是注释
* `{pinyins}` 中使用逗号分隔多个拼音
* 示例:
# 注释
U+4E2D: zhōng,zhòng # 中
[Unihan Database][unihan] 数据版本:
> Date: 2018-11-09 21:36:19 GMT [JHJ]
> Unicode version: 12.0.0
* `kHanyuPinyin.txt`: [Unihan Database][unihan] 中 [kHanyuPinyin](http://www.unicode.org/reports/tr38/#kHanyuPinyin) 部分的拼音数据(来源于《漢語大字典》的拼音数据)
* `kXHC1983.txt`: [Unihan Database][unihan] 中 [kXHC1983](http://www.unicode.org/reports/tr38/#kXHC1983) 部分的拼音数据(来源于《现代汉语词典》的拼音数据)
* `kHanyuPinlu.txt`: [Unihan Database][unihan] 中 [kHanyuPinlu](http://www.unicode.org/reports/tr38/#kHanyuPinlu) 部分的拼音数据(来源于《現代漢語頻率詞典》的拼音数据)
* `kMandarin.txt`: [Unihan Database][unihan] 中 [kMandarin](http://www.unicode.org/reports/tr38/#kMandarin) 部分的拼音数据普通话中最常用的一个读音。zh-CN 为主,如果 zh-CN 中没有则使用 zh-TW 中的拼音)
* `kMandarin_overwrite.txt`: 手工纠正 `kMandarin.txt` 中有误的拼音数据(**可以修改**
* `GBK_PUA.txt`: [Private Use Area](https://en.wikipedia.org/wiki/Private_Use_Areas) 中有拼音的汉字,参考 [GB 18030 - 维基百科,自由的百科全书](https://zh.wikipedia.org/wiki/GB_18030#PUA) **可以修改**
* `nonCJKUI.txt`: 不属于 [CJK Unified Ideograph](https://en.wikipedia.org/wiki/CJK_Unified_Ideographs) 但是却有拼音的字符(**可以修改**
* `kanji.txt`: [日本自造汉字](https://zh.wikipedia.org/wiki/%E6%97%A5%E6%9C%AC%E6%B1%89%E5%AD%97#7_%E6%97%A5%E6%9C%AC%E6%B1%89%E5%AD%97%E7%9A%84%E6%B1%89%E8%AF%AD%E6%99%AE%E9%80%9A%E8%AF%9D%E8%A7%84%E8%8C%83%E8%AF%BB%E9%9F%B3%E8%A1%A8) 的拼音数据 **可以修改**
* `kMandarin_8105.txt`: [《通用规范汉字表》](https://zh.wikipedia.org/wiki/通用规范汉字表)(2013 年版)里 8105 个汉字最常用的一个读音 (**可以修改**)
* `overwrite.txt`: 手工纠正的拼音数据(**可以修改**
* `pinyin.txt`: 合并上述文件后的拼音数据
* `zdic.txt`: [汉典网](http://zdic.net) 的拼音数据(**可以修改**
## 参考资料
* [汉语拼音方案](http://www.moe.edu.cn/s78/A19/yxs_left/moe_810/s230/195802/t19580201_186000.html)
* [Unihan Database Lookup](http://www.unicode.org/charts/unihan.html)
* [汉典 zdic.net](http://www.zdic.net/)
* [字海网,叶典网](http://zisea.com/)
* [国学大师_国学网](http://www.guoxuedashi.com/)
* [Unicode、GB2312、GBK和GB18030中的汉字](http://www.fmddlmyy.cn/text24.html)
* [GB 18030 - 维基百科,自由的百科全书](https://zh.wikipedia.org/wiki/GB_18030#PUA)
* [通用规范汉字表 - 维基百科,自由的百科全书](https://zh.wikipedia.org/wiki/%E9%80%9A%E7%94%A8%E8%A7%84%E8%8C%83%E6%B1%89%E5%AD%97%E8%A1%A8)
* [Chinas 通用规范汉字表 (Tōngyòng Guīfàn Hànzìbiǎo)](https://blogs.adobe.com/CCJKType/2014/03/china-8105.html)
* [日本汉字的汉语读音规范](http://www.moe.gov.cn/s78/A19/yxs_left/moe_810/s230/201001/t20100115_75698.html)
* [日本汉字的汉语普通话规范读音表- 维基百科](https://zh.wikipedia.org/wiki/%E6%97%A5%E6%9C%AC%E6%B1%89%E5%AD%97#7_%E6%97%A5%E6%9C%AC%E6%B1%89%E5%AD%97%E7%9A%84%E6%B1%89%E8%AF%AD%E6%99%AE%E9%80%9A%E8%AF%9D%E8%A7%84%E8%8C%83%E8%AF%BB%E9%9F%B3%E8%A1%A8)
[unihan]: http://www.unicode.org/charts/unihan.html
# phrase-pinyin-data [![Build Status](https://travis-ci.org/mozillazg/phrase-pinyin-data.svg?branch=master)](https://travis-ci.org/mozillazg/phrase-pinyin-data)
词语拼音数据。
## 数据介绍
拼音数据的格式:
```
{phrase}: {pinyin}
```
* 以 `#` 开头的行是注释
* 行尾的 `#` 也是注释
* `{phrase}` 汉字词语
* `{pinyin}` 词语的拼音,使用空格分隔每个汉字的拼音
* 一行一个词语的读音,有多个音的词语会出现在多行
* 示例:
```
# 注释
中国: zhōng guó
北京: běi jīng # 注释
```
文件说明:
* `overwrite.txt`: 手工纠正的拼音数据
* `pinyin.txt`: `pinyin.txt + overwrite.txt` 后的拼音数据
* `zdic_cibs.txt`: [汉典网](http://www.zdic.net/) 汉语词典拼音数据
* `zdic_cybs.txt`: [汉典网](http://www.zdic.net/) 成语词典拼音数据
* `cc_cedict.txt`: [cc-cedict.org](https://cc-cedict.org/) 拼音数据
* `large_pinyin.txt`: `zdic_cibs.txt + zdic_cybs.txt + cc_cedict.txt + pinyin.txt + overwrite.txt` 后的拼音数据
## 参考资料
* 初始数据基于 [phrases-dict.js](https://github.com/hotoo/pinyin/blob/05f74496c34ccb32db1a0fd0b358a798a22a51e5/data/phrases-dict.js) 和 [phrases_dict.py](https://github.com/mozillazg/python-pinyin/blob/366de0363ff1fb9a718ce668448bea59de09a4bf/pypinyin/phrases_dict.py)
* [汉典 zdic.net](http://www.zdic.net/)
* [字海网,叶典网](http://zisea.com/)
* [国学大师_国学网](http://www.guoxuedashi.com/)
* [CC-CEDICT download - MDBG English to Chinese dictionary](http://www.mdbg.net/chindict/chindict.php?page=cc-cedict)

@ -0,0 +1,51 @@
"""
### pinyinkit
文本转拼音的模块依赖python-pinyinjiebaphrase-pinyin-data模块
"""
import re
from .core import lazy_pinyin, pinyin, slug, Style, initialize
from pypinyin.style import convert
# 兼容0.1.0之前的版本。
# 音调5为轻声
_diao_re = re.compile(r"([12345]$)")
def text2pinyin(text, errors=None, **kwargs):
"""
汉语文本转为拼音列表
:param text: str,汉语文本字符串
:param errors: function,对转拼音失败的字符的处理函数默认保留原样
:return: list,拼音列表
"""
if errors is None:
errors = default_errors
pin = lazy_pinyin(text, style=Style.TONE3, errors=errors, strict=True, neutral_tone_with_five=True, **kwargs)
return pin
def default_errors(x):
return list(x)
def split_pinyin(py):
"""
单个拼音转为音素列表
:param py: str,拼音字符串
:param errors: function,对OOV拼音的处理函数默认保留原样
:return: list,音素列表
"""
parts = _diao_re.split(py)
if len(parts) == 1:
fuyuan = py
diao = "5"
else:
fuyuan = parts[0]
diao = parts[1]
return [fuyuan, diao]
if __name__ == "__main__":
print(__file__)
assert text2pinyin("拼音") == ['pin1', 'yin1']
assert text2pinyin("汉字,a1") == ['han4', 'zi4', ',', 'a', '1']

@ -0,0 +1,457 @@
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# author: kuangdd
# date: 2020/5/30
"""
Base on python-pinyin(pypinyin), phrase-pinyin-data, pinyin-data and jieba.
"""
from __future__ import unicode_literals
from itertools import chain
from pypinyin.compat import text_type
from pypinyin.constants import (
PHRASES_DICT, PINYIN_DICT, Style
)
from pypinyin.converter import DefaultConverter, _mixConverter
from pypinyin.seg import mmseg
from pypinyin.seg.simpleseg import seg
from pypinyin.utils import _replace_tone2_style_dict_to_default
from tqdm import tqdm
import jieba
import re
from pathlib import Path
_ziyin_re = re.compile(r"^U\+(\w+?):(.+?)#(.+)$")
_true_pin_re = re.compile(r"[^a-zA-Z]+")
is_initialized = False
def load_single_dict(pinyin_dict, style='default'):
"""载入用户自定义的单字拼音库
:param pinyin_dict: 单字拼音库比如 ``{0x963F: u"ā,ē"}``
:param style: pinyin_dict 参数值的拼音库风格. 支持 'default', 'tone2'
:type pinyin_dict: dict
"""
if style == 'tone2':
for k, v in pinyin_dict.items():
v = _replace_tone2_style_dict_to_default(v)
PINYIN_DICT[k] = v
else:
PINYIN_DICT.update(pinyin_dict)
mmseg.retrain(mmseg.seg)
def load_phrases_dict(phrases_dict, style='default'):
"""载入用户自定义的词语拼音库
:param phrases_dict: 词语拼音库比如 ``{u"阿爸": [[u"ā"], [u""]]}``
:param style: phrases_dict 参数值的拼音库风格. 支持 'default', 'tone2'
:type phrases_dict: dict
"""
if style == 'tone2':
for k, value in phrases_dict.items():
v = [
list(map(_replace_tone2_style_dict_to_default, pys))
for pys in value
]
PHRASES_DICT[k] = v
else:
PHRASES_DICT.update(phrases_dict)
mmseg.retrain(mmseg.seg)
def parse_pinyin_txt(inpath):
# U+4E2D: zhōng,zhòng # 中
outs = []
with open(inpath, encoding="utf8") as fin:
for line in tqdm(fin, desc='load pinyin', ncols=80, mininterval=1):
if line.startswith("#"):
continue
res = _ziyin_re.search(line)
if res:
zi = res.group(3).strip()
if len(zi) == 1:
outs.append([zi, res.group(2).strip().split(",")])
else:
print(line)
elif line.strip():
print(line)
return {ord(z): ','.join(p) for z, p in outs}
def parse_phrase_txt(inpath):
# 一一对应: yī yī duì yìng
outs = []
with open(inpath, encoding="utf8") as fin:
for line in tqdm(fin, desc='load phrase', ncols=80, mininterval=1):
if line.startswith("#"):
continue
parts = line.split(":")
zs = parts[0].strip()
ps = parts[1].strip().split()
if len(parts) == 2 and len(zs) == len(ps) and len(zs) >= 2:
outs.append([zs, ps])
elif line.strip():
print(line)
return {zs: [[p] for p in ps] for zs, ps in outs}
def initialize():
# 导入数据
inpath = Path(__file__).absolute().parent.joinpath('phrase_pinyin.txt.py')
_phrases_dict = parse_phrase_txt(inpath)
load_phrases_dict(_phrases_dict) # big:398815 small:36776
inpath = Path(__file__).absolute().parent.joinpath('single_pinyin.txt.py')
_pinyin_dict = parse_pinyin_txt(inpath)
load_single_dict(_pinyin_dict) # 41451
jieba.initialize()
# for word, _ in tqdm(_phrases_dict.items(), desc='jieba add word', ncols=80, mininterval=1):
# jieba.add_word(word)
global is_initialized
is_initialized = True
class Pinyin(object):
def __init__(self, converter=None, **kwargs):
self._converter = converter or DefaultConverter()
def pinyin(self, hans, style=Style.TONE, heteronym=False,
errors='default', strict=True, **kwargs):
"""将汉字转换为拼音,返回汉字的拼音列表。
:param hans: 汉字字符串( ``'你好吗'`` )或列表( ``['你好', '']`` ).
可以使用自己喜爱的分词模块对字符串进行分词处理,
只需将经过分词处理的字符串列表传进来就可以了
:type hans: unicode 字符串或字符串列表
:param style: 指定拼音风格默认是 :py:attr:`~pypinyin.Style.TONE` 风格
更多拼音风格详见 :class:`~pypinyin.Style`
:param errors: 指定如何处理没有拼音的字符详见 :ref:`handle_no_pinyin`
* ``'default'``: 保留原始字符
* ``'ignore'``: 忽略该字符
* ``'replace'``: 替换为去掉 ``\\u`` unicode 编码字符串
(``'\\u90aa'`` => ``'90aa'``)
* callable 对象: 回调函数之类的可调用对象
:param heteronym: 是否启用多音字
:param strict: 只获取声母或只获取韵母相关拼音风格的返回结果
是否严格遵照汉语拼音方案来处理声母和韵母
详见 :ref:`strict`
:return: 拼音列表
:rtype: list
"""
# 对字符串进行分词处理
if isinstance(hans, text_type):
han_list = self.seg(hans)
else:
han_list = chain(*(self.seg(x) for x in hans))
pys = []
for words in han_list:
pys.extend(
self._converter.convert(
words, style, heteronym, errors, strict=strict))
return pys
def lazy_pinyin(self, hans, style=Style.NORMAL,
errors='default', strict=True, **kwargs):
"""将汉字转换为拼音,返回不包含多音字结果的拼音列表.
:py:func:`~pypinyin.pinyin` 的区别是每个汉字的拼音是个字符串
并且每个字只包含一个读音.
:param hans: 汉字
:type hans: unicode or list
:param style: 指定拼音风格默认是 :py:attr:`~pypinyin.Style.NORMAL` 风格
更多拼音风格详见 :class:`~pypinyin.Style`
:param errors: 指定如何处理没有拼音的字符详情请参考
:py:func:`~pypinyin.pinyin`
:param strict: 只获取声母或只获取韵母相关拼音风格的返回结果
是否严格遵照汉语拼音方案来处理声母和韵母
详见 :ref:`strict`
:return: 拼音列表(e.g. ``['zhong', 'guo', 'ren']``)
:rtype: list
"""
return list(
chain(
*self.pinyin(
hans, style=style, heteronym=False,
errors=errors, strict=strict)))
def pre_seg(self, hans, **kwargs):
"""对字符串进行分词前将调用 ``pre_seg`` 方法对未分词的字符串做预处理。
默认原样返回传入的 ``hans``
如果这个方法的返回值类型是 ``list``表示返回的是一个分词后的结果此时
``seg`` 方法中将不再调用 ``seg_function`` 进行分词
:param hans: 分词前的字符串
:return: ``None`` or ``list``
"""
outs = list(jieba.cut(hans)) # 默认用jieba分词从语义角度分词。
return outs
def seg(self, hans, **kwargs):
"""对汉字进行分词。
分词前会调用 ``pre_seg`` 方法分词后会调用 ``post_seg`` 方法
:param hans:
:return:
"""
pre_data = self.pre_seg(hans)
if isinstance(pre_data, list):
seg_data = pre_data
else:
seg_data = self.get_seg()(hans)
post_data = self.post_seg(hans, seg_data)
if isinstance(post_data, list):
return post_data
return seg_data
def get_seg(self, **kwargs):
"""获取分词函数。
:return: 分词函数
"""
return seg
def post_seg(self, hans, seg_data, **kwargs):
"""对字符串进行分词后将调用 ``post_seg`` 方法对分词后的结果做处理。
默认原样返回传入的 ``seg_data``
如果这个方法的返回值类型是 ``list``表示对分词结果做了二次处理此时
``seg`` 方法将以这个返回的数据作为返回值
:param hans: 分词前的字符串
:param seg_data: 分词后的结果
:type seg_data: list
:return: ``None`` or ``list``
"""
pass
_default_convert = DefaultConverter()
_default_pinyin = Pinyin(_default_convert)
def to_fixed(pinyin, style, strict=True):
# 用于向后兼容TODO: 废弃
return _default_convert.convert_style(
'', pinyin, style=style, strict=strict, default=pinyin)
_to_fixed = to_fixed
def handle_nopinyin(chars, errors='default', heteronym=True):
# 用于向后兼容TODO: 废弃
return _default_convert.handle_nopinyin(
chars, style=None, errors=errors, heteronym=heteronym, strict=True)
def single_pinyin(han, style, heteronym, errors='default', strict=True):
# 用于向后兼容TODO: 废弃
return _default_convert._single_pinyin(
han, style, heteronym, errors=errors, strict=strict)
def phrase_pinyin(phrase, style, heteronym, errors='default', strict=True):
# 用于向后兼容TODO: 废弃
return _default_convert._phrase_pinyin(
phrase, style, heteronym, errors=errors, strict=strict)
def pinyin(hans, style=Style.TONE, heteronym=False,
errors='default', strict=True,
v_to_u=False, neutral_tone_with_five=False):
"""将汉字转换为拼音,返回汉字的拼音列表。
:param hans: 汉字字符串( ``'你好吗'`` )或列表( ``['你好', '']`` ).
可以使用自己喜爱的分词模块对字符串进行分词处理,
只需将经过分词处理的字符串列表传进来就可以了
:type hans: unicode 字符串或字符串列表
:param style: 指定拼音风格默认是 :py:attr:`~pypinyin.Style.TONE` 风格
更多拼音风格详见 :class:`~pypinyin.Style`
:param errors: 指定如何处理没有拼音的字符详见 :ref:`handle_no_pinyin`
* ``'default'``: 保留原始字符
* ``'ignore'``: 忽略该字符
* ``'replace'``: 替换为去掉 ``\\u`` unicode 编码字符串
(``'\\u90aa'`` => ``'90aa'``)
* callable 对象: 回调函数之类的可调用对象
:param heteronym: 是否启用多音字
:param strict: 只获取声母或只获取韵母相关拼音风格的返回结果
是否严格遵照汉语拼音方案来处理声母和韵母
详见 :ref:`strict`
:param v_to_u: 无声调相关拼音风格下的结果是否使用 ``ü`` 代替原来的 ``v``
:type v_to_u: bool
:param neutral_tone_with_five: 声调使用数字表示的相关拼音风格下的结果是否
使用 5 标识轻声
:type neutral_tone_with_five: bool
:return: 拼音列表
:rtype: list
:raise AssertionError: 当传入的字符串不是 unicode 字符时会抛出这个异常
Usage::
>>> from pypinyin import pinyin, Style
>>> import pypinyin
>>> pinyin('中心')
[['zhōng'], ['xīn']]
>>> pinyin('中心', heteronym=True) # 启用多音字模式
[['zhōng', 'zhòng'], ['xīn']]
>>> pinyin('中心', style=Style.FIRST_LETTER) # 设置拼音风格
[['z'], ['x']]
>>> pinyin('中心', style=Style.TONE2)
[['zho1ng'], ['xi1n']]
>>> pinyin('中心', style=Style.CYRILLIC)
[['чжун1'], ['синь1']]
>>> pinyin('战略', v_to_u=True, style=Style.NORMAL)
[['zhan'], ['lüe']]
>>> pinyin('衣裳', style=Style.TONE3, neutral_tone_with_five=True)
[['yi1'], ['shang5']]
"""
global is_initialized
if not is_initialized:
initialize()
is_initialized = True
_pinyin = Pinyin(_mixConverter(
v_to_u=v_to_u, neutral_tone_with_five=neutral_tone_with_five))
return _pinyin.pinyin(
hans, style=style, heteronym=heteronym, errors=errors, strict=strict)
def slug(hans, style=Style.NORMAL, heteronym=False, separator='-',
errors='default', strict=True):
"""将汉字转换为拼音,然后生成 slug 字符串.
:param hans: 汉字
:type hans: unicode or list
:param style: 指定拼音风格默认是 :py:attr:`~pypinyin.Style.NORMAL` 风格
更多拼音风格详见 :class:`~pypinyin.Style`
:param heteronym: 是否启用多音字
:param separator: 两个拼音间的分隔符/连接符
:param errors: 指定如何处理没有拼音的字符详情请参考
:py:func:`~pypinyin.pinyin`
:param strict: 只获取声母或只获取韵母相关拼音风格的返回结果
是否严格遵照汉语拼音方案来处理声母和韵母
详见 :ref:`strict`
:return: slug 字符串.
:raise AssertionError: 当传入的字符串不是 unicode 字符时会抛出这个异常
::
>>> import pypinyin
>>> from pypinyin import Style
>>> pypinyin.slug('中国人')
'zhong-guo-ren'
>>> pypinyin.slug('中国人', separator=' ')
'zhong guo ren'
>>> pypinyin.slug('中国人', style=Style.FIRST_LETTER)
'z-g-r'
>>> pypinyin.slug('中国人', style=Style.CYRILLIC)
'чжун1-го2-жэнь2'
"""
global is_initialized
if not is_initialized:
initialize()
is_initialized = True
return separator.join(
chain(
*_default_pinyin.pinyin(
hans, style=style, heteronym=heteronym,
errors=errors, strict=strict
)
)
)
def lazy_pinyin(hans, style=Style.NORMAL, errors='default', strict=True,
v_to_u=False, neutral_tone_with_five=False):
"""将汉字转换为拼音,返回不包含多音字结果的拼音列表.
:py:func:`~pypinyin.pinyin` 的区别是返回的拼音是个字符串
并且每个字只包含一个读音.
:param hans: 汉字
:type hans: unicode or list
:param style: 指定拼音风格默认是 :py:attr:`~pypinyin.Style.NORMAL` 风格
更多拼音风格详见 :class:`~pypinyin.Style`
:param errors: 指定如何处理没有拼音的字符详情请参考
:py:func:`~pypinyin.pinyin`
:param strict: 只获取声母或只获取韵母相关拼音风格的返回结果
是否严格遵照汉语拼音方案来处理声母和韵母
详见 :ref:`strict`
:param v_to_u: 无声调相关拼音风格下的结果是否使用 ``ü`` 代替原来的 ``v``
:type v_to_u: bool
:param neutral_tone_with_five: 声调使用数字表示的相关拼音风格下的结果是否
使用 5 标识轻声
:type neutral_tone_with_five: bool
:return: 拼音列表(e.g. ``['zhong', 'guo', 'ren']``)
:rtype: list
:raise AssertionError: 当传入的字符串不是 unicode 字符时会抛出这个异常
Usage::
>>> from pypinyin import lazy_pinyin, Style
>>> import pypinyin
>>> lazy_pinyin('中心')
['zhong', 'xin']
>>> lazy_pinyin('中心', style=Style.TONE)
['zhōng', 'xīn']
>>> lazy_pinyin('中心', style=Style.FIRST_LETTER)
['z', 'x']
>>> lazy_pinyin('中心', style=Style.TONE2)
['zho1ng', 'xi1n']
>>> lazy_pinyin('中心', style=Style.CYRILLIC)
['чжун1', 'синь1']
>>> lazy_pinyin('战略', v_to_u=True)
['zhan', 'lüe']
>>> lazy_pinyin('衣裳', style=Style.TONE3, neutral_tone_with_five=True)
['yi1', 'shang5']
"""
global is_initialized
if not is_initialized:
initialize()
is_initialized = True
_pinyin = Pinyin(_mixConverter(
v_to_u=v_to_u, neutral_tone_with_five=neutral_tone_with_five))
return _pinyin.lazy_pinyin(
hans, style=style, errors=errors, strict=strict)
if __name__ == "__main__":
print(__file__)
han = '老师很重视这个问题啊,请重说一遍。。。很难说有山难发生,理发师和会计谁会发财?'
out = _default_pinyin.seg(han)
assert out == ['老师', '', '重视', '这个', '问题', '', '', '请重', '', '一遍', '', '', '', '很难说', '有山难', '发生', '',
'理发师', '', '会计', '', '', '发财', '']
out = lazy_pinyin(han, style=8, neutral_tone_with_five=True)
assert out == ['lao3', 'shi1', 'hen3', 'zhong4', 'shi4', 'zhe4', 'ge4', 'wen4', 'ti2', 'a5', '', 'qing3', 'zhong4',
'shuo1', 'yi1', 'bian4', '', '', '', 'hen3', 'nan2', 'shuo1', 'you3', 'shan1', 'nan2', 'fa1',
'sheng1', '', 'li3', 'fa4', 'shi1', 'he2', 'kuai4', 'ji4', 'shui2', 'hui4', 'fa1', 'cai2', '']
out = slug(han, style=8, separator=' ')
assert out == 'lao3 shi1 hen3 zhong4 shi4 zhe4 ge4 wen4 ti2 a qing3 zhong4 shuo1 yi1 bian4 。 。 。 hen3 nan2 shuo1 you3 shan1 nan2 fa1 sheng1 li3 fa4 shi1 he2 kuai4 ji4 shui2 hui4 fa1 cai2 '

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

@ -0,0 +1,6 @@
pypinyin
hanziconv
jieba
inflect
unidecode
tqdm

@ -0,0 +1,44 @@
#!usr/bin/env python
# -*- coding: utf-8 -*-
# author: kuangdd
# date: 2019/12/1
"""
local
"""
import logging
logging.basicConfig(level=logging.INFO)
def run_text2phoneme():
from phkit.chinese.sequence import text2phoneme, text2sequence
text = "汉字转音素TTS《Text to speech》。"
# text = "岂有此理"
# text = "我的儿子玩会儿"
out = text2phoneme(text)
print(out)
# ['h', 'an', '4', '-', 'z', 'iy', '4', '-', 'zh', 'uan', '3', '-', 'ii', 'in', '1', '-', 's', 'u', '4', '-', ',',
# 'Tt', 'Tt', 'Ss', ':', '(', 'T', 'E', 'X', 'T', '#', 'T', 'O', '#', 'S', 'P', 'E', 'E', 'C', 'H', ')', '.', '-',
# '~', '_']
out = text2sequence(text)
print(out)
# [11, 32, 76, 2, 28, 51, 76, 2, 29, 59, 75, 2, 12, 46, 73, 2, 22, 56, 76, 2, 133, 97, 97, 96, 135, 138, 123, 108,
# 127, 123, 137, 123, 118, 137, 122, 119, 108, 108, 106, 111, 139, 132, 2, 1, 0]
def run_english():
from phkit.english import text_to_sequence, sequence_to_text
from phkit.english.cmudict import CMUDict
text = "text to speech"
cmupath = 'phkit/english/cmu_dictionary'
cmudict = CMUDict(cmupath)
seq = text_to_sequence(text, cleaner_names=["english_cleaners"], dictionary=cmudict)
print(seq)
txt = sequence_to_text(seq)
print(txt)
if __name__ == "__main__":
print(__file__)
run_text2phoneme()
run_english()

@ -0,0 +1,86 @@
#!usr/bin/env python
# -*- coding: utf-8 -*-
# author: kuangdd
# date: 2019/12/15
"""
语音处理工具箱
生成whl格式安装包python setup.py bdist_wheel
直接上传pypipython setup.py sdist upload
用twine上传pypi
生成安装包python setup.py sdist
上传安装包twine upload dist/phkit-0.0.3.tar.gz
注意需要在home目录下建立.pypirc配置文件文件内容格式
[distutils]
index-servers=pypi
[pypi]
repository = https://upload.pypi.org/legacy/
username: admin
password: admin
"""
from setuptools import setup, find_packages
import os
import logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(os.path.splitext(os.path.basename(__name__))[0])
install_requires = ['pypinyin>=0.41.0', 'hanziconv', 'jieba>=0.42.1', 'tqdm', 'inflect', 'unidecode']
requires = install_requires
def create_readme():
from phkit import readme_docs
docs = []
with open("README.md", "wt", encoding="utf8") as fout:
for doc in readme_docs:
fout.write(doc)
docs.append(doc)
return "".join(docs)
def pip_install():
for pkg in install_requires + requires:
try:
os.system("pip install {}".format(pkg))
except Exception as e:
logger.info("pip install {} failed".format(pkg))
pip_install()
phkit_doc = create_readme()
from phkit import __version__ as phkit_version
setup(
name="phkit",
version=phkit_version,
author="kuangdd",
author_email="kuangdd@foxmail.com",
description="phoneme toolkit",
long_description=phkit_doc,
long_description_content_type="text/markdown",
url="https://github.com/KuangDD/phkit",
packages=find_packages(exclude=['contrib', 'docs', 'tests*']),
install_requires=install_requires, # 指定项目最低限度需要运行的依赖项
python_requires='>=3.5', # python的依赖关系
package_data={
'txt': ['requirements.txt'],
'md': ['**/*.md', '*.md'],
}, # 包数据,通常是与软件包实现密切相关的数据
classifiers=[
'Intended Audience :: Developers',
'Topic :: Software Development :: Build Tools',
'License :: OSI Approved :: MIT License',
'Programming Language :: Python :: 3.5',
'Programming Language :: Python :: 3.6',
'Programming Language :: Python :: 3.7',
"Operating System :: OS Independent",
],
)
if __name__ == "__main__":
print(__file__)

@ -0,0 +1,51 @@
#!usr/bin/env python
# -*- coding: utf-8 -*-
# author: kuangdd
# date: 2020/2/18
"""
"""
def test_phkit():
from phkit import text2phoneme, text2sequence, symbol_chinese
from phkit import chinese_sequence_to_text, chinese_text_to_sequence
text = "汉字转音素TTS《Text to speech》。"
target_ph = ['h', 'an', '4', '-', 'z', 'iy', '4', '-', 'zh', 'uan', '3', '-', 'ii', 'in', '1', '-', 's', 'u', '4',
'-', ',', '-',
'Tt', 'Tt', 'Ss', '-', ':', '-', '(', '-', 'T', 'E', 'X', 'T', '-', '#', '-', 'T', 'O', '-', '#', '-',
'S', 'P', 'E', 'E', 'C', 'H', '-', ')', '-', '.', '-', '~', '_']
result = text2phoneme(text)
assert result == target_ph
target_seq = [11, 32, 74, 2, 28, 51, 74, 2, 29, 59, 73, 2, 12, 46, 71, 2, 22, 56, 74, 2, 131, 2, 95, 95, 94, 2, 133,
2, 136, 2, 121,
106, 125, 121, 2, 135, 2, 121, 116, 2, 135, 2, 120, 117, 106, 106, 104, 109, 2, 137, 2, 130, 2, 1, 0]
result = text2sequence(text)
assert result == target_seq
result = chinese_text_to_sequence(text)
assert result == target_seq
target_ph = ' '.join(target_ph)
result = chinese_sequence_to_text(result)
assert result == target_ph
assert len(symbol_chinese) == 145
text = "岂有此理"
target = ['q', 'i', '2', '-', 'ii', 'iu', '3', '-', 'c', 'iy', '2', '-', 'l', 'i', '3', '-', '~', '_']
result = text2phoneme(text)
assert result == target
text = "我的儿子玩会儿"
target = ['uu', 'uo', '3', '-', 'd', 'e', '5', '-', 'ee', 'er', '2', '-', 'z', 'iy', '5', '-', 'uu', 'uan', '2',
'-', 'h', 'ui', '4', '-', 'ee', 'er', '5', '-', '~', '_']
result = text2phoneme(text)
assert result == target
if __name__ == "__main__":
print(__file__)
test_phkit()
Loading…
Cancel
Save