# Chinese Text Normalization for Speech Processing ## Problem Search for "Text Normalization"(TN) on Google and Github, you can hardly find open-source projects that are "read-to-use" for text normalization tasks. Instead, you find a bunch of NLP toolkits or frameworks that *supports* TN functionality. There is quite some work between "support text normalization" and "do text normalization". ## Reason * TN is language-dependent, more or less. Some of TN processing methods are shared across languages, but a good TN module always involves language-specific knowledge and treatments, more or less. * TN is task-specific. Even for the same language, different applications require quite different TN. * TN is "dirty" Constructing and maintaining a set of TN rewrite-rules is painful, whatever toolkits and frameworks you choose. Subtle and intrinsic complexities hide inside TN task itself, not in tools or frameworks. * mature TN module is an asset Since constructing and maintaining TN is hard, it is actually an asset for commercial companies, hence it is unlikely to find a product-level TN in open-source community (correct me if you find any) * TN is a less important topic for either academic or commercials. ## Goal This project sets up a ready-to-use TN module for **Chinese**. Since my background is **speech processing**, this project should be able to handle most common TN tasks, in **Chinese ASR** text processing pipelines. ## Normalizers 1. supported NSW (Non-Standard-Word) Normalization |NSW type|raw|normalized| |-|-|-| |cardinal|这块黄金重达324.75克|这块黄金重达三百二十四点七五克| |date|她出生于86年8月18日,她弟弟出生于1995年3月1日|她出生于八六年八月十八日 她弟弟出生于一九九五年三月一日| |digit|电影中梁朝伟扮演的陈永仁的编号27149|电影中梁朝伟扮演的陈永仁的编号二七一四九| |fraction|现场有7/12的观众投出了赞成票|现场有十二分之七的观众投出了赞成票| |money|随便来几个价格12块5,34.5元,20.1万|随便来几个价格十二块五 三十四点五元 二十点一万| |percentage|明天有62%的概率降雨|明天有百分之六十二的概率降雨| |telephone|这是固话0421-33441122
这是手机+86 18544139121|这是固话零四二一三三四四一一二二
这是手机八六一八五四四一三九一二一| acknowledgement: the NSW normalization codes are based on [Zhiyang Zhou's work here](https://github.com/Joee1995/chn_text_norm.git) 1. punctuation removal For Chinese, it removes punctuation list collected in [Zhon](https://github.com/tsroten/zhon) project, containing * non-stop puncs ``` '"#$%&'()*+,-/:;<=>@[\]^_`{|}~⦅⦆「」、、〃》「」『』【】〔〕〖〗〘〙〚〛〜〝〞〟〰〾〿–—‘’‛“”„‟…‧﹏' ``` * stop puncs ``` '!?。。' ``` For English, it removes Python's `string.punctuation` 1. multilingual English word upper/lower case conversion since ASR/TTS lexicons usually unify English entries to uppercase or lowercase, the TN module should adapt with lexicon accordingly. ## Supported text format 1. plain text, preferably one sentence per line(most common case in ASR processing). ``` 今天早饭吃了没 没吃回家吃去吧 ... ``` plain text is default format. 2. Kaldi's transcription format ``` KALDI_KEY_UTT001 今天早饭吃了没 KALDI_KEY_UTT002 没吃回家吃去吧 ... ``` TN will skip first column key section, normalize latter transcription text pass `--has_key` option to switch to kaldi format. _note: All input text should be UTF-8 encoded._ ## Run examples * TN (python) make sure you have **python3**, python2.X won't work correctly. `sh run.sh` in `TN` dir, and compare raw text and normalized text. * ITN (thrax) make sure you have **thrax** installed, and your PATH should be able to find thrax binaries. `sh run.sh` in `ITN` dir. check Makefile for grammar dependency. ## possible future work Since TN is a typical "done is better than perfect" module in context of ASR, and the current state is sufficient for my purpose, I probably won't update this repo frequently. there are indeed something that needs to be improved: * For TN, NSW normalizers in TN dir are based on regular expression, I've found some unintended matches, those pattern regexps need to be refined for more precise TN coverage. * For ITN, extend those thrax rewriting grammars to cover more scenarios. * Further more, nowadays commercial systems start to introduce RNN-like models into TN, and a mix of (rule-based & model-based) system is state-of-the-art. More readings about this, look for Richard Sproat and KyleGorman's work at Google. END