You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
PaddleSpeech/third_party/chinese_text_normalization/README.md

113 lines
4.9 KiB

# Chinese Text Normalization for Speech Processing
## Problem
Search for "Text Normalization"(TN) on Google and Github, you can hardly find open-source projects that are "read-to-use" for text normalization tasks. Instead, you find a bunch of NLP toolkits or frameworks that *supports* TN functionality. There is quite some work between "support text normalization" and "do text normalization".
## Reason
* TN is language-dependent, more or less.
Some of TN processing methods are shared across languages, but a good TN module always involves language-specific knowledge and treatments, more or less.
* TN is task-specific.
Even for the same language, different applications require quite different TN.
* TN is "dirty"
Constructing and maintaining a set of TN rewrite-rules is painful, whatever toolkits and frameworks you choose. Subtle and intrinsic complexities hide inside TN task itself, not in tools or frameworks.
* mature TN module is an asset
Since constructing and maintaining TN is hard, it is actually an asset for commercial companies, hence it is unlikely to find a product-level TN in open-source community (correct me if you find any)
* TN is a less important topic for either academic or commercials.
## Goal
This project sets up a ready-to-use TN module for **Chinese**. Since my background is **speech processing**, this project should be able to handle most common TN tasks, in **Chinese ASR** text processing pipelines.
## Normalizers
1. supported NSW (Non-Standard-Word) Normalization
|NSW type|raw|normalized|
|-|-|-|
|cardinal|这块黄金重达324.75克|这块黄金重达三百二十四点七五克|
|date|她出生于86年8月18日她弟弟出生于1995年3月1日|她出生于八六年八月十八日 她弟弟出生于一九九五年三月一日|
|digit|电影中梁朝伟扮演的陈永仁的编号27149|电影中梁朝伟扮演的陈永仁的编号二七一四九|
|fraction|现场有7/12的观众投出了赞成票|现场有十二分之七的观众投出了赞成票|
|money|随便来几个价格12块534.5元20.1万|随便来几个价格十二块五 三十四点五元 二十点一万|
|percentage|明天有62的概率降雨|明天有百分之六十二的概率降雨|
|telephone|这是固话0421-33441122<br>这是手机+86 18544139121|这是固话零四二一三三四四一一二二<br>这是手机八六一八五四四一三九一二一|
acknowledgement: the NSW normalization codes are based on [Zhiyang Zhou's work here](https://github.com/Joee1995/chn_text_norm.git)
1. punctuation removal
For Chinese, it removes punctuation list collected in [Zhon](https://github.com/tsroten/zhon) project, containing
* non-stop puncs
```
'"#$%&'()*+,-/:;<=>@[\]^_`{|}~⦅⦆「」、、〃》「」『』【】〔〕〖〗〘〙〚〛〜〝〞〟〰〾〿–—‘’‛“”„‟…‧﹏'
```
* stop puncs
```
'!?。。'
```
For English, it removes Python's `string.punctuation`
1. multilingual English word upper/lower case conversion
since ASR/TTS lexicons usually unify English entries to uppercase or lowercase, the TN module should adapt with lexicon accordingly.
## Supported text format
1. plain text, preferably one sentence per line(most common case in ASR processing).
```
今天早饭吃了没
没吃回家吃去吧
...
```
plain text is default format.
2. Kaldi's transcription format
```
KALDI_KEY_UTT001 今天早饭吃了没
KALDI_KEY_UTT002 没吃回家吃去吧
...
```
TN will skip first column key section, normalize latter transcription text
pass `--has_key` option to switch to kaldi format.
_note: All input text should be UTF-8 encoded._
## Run examples
* TN (python)
make sure you have **python3**, python2.X won't work correctly.
`sh run.sh` in `TN` dir, and compare raw text and normalized text.
* ITN (thrax)
make sure you have **thrax** installed, and your PATH should be able to find thrax binaries.
`sh run.sh` in `ITN` dir. check Makefile for grammar dependency.
## possible future work
Since TN is a typical "done is better than perfect" module in context of ASR, and the current state is sufficient for my purpose, I probably won't update this repo frequently.
there are indeed something that needs to be improved:
* For TN, NSW normalizers in TN dir are based on regular expression, I've found some unintended matches, those pattern regexps need to be refined for more precise TN coverage.
* For ITN, extend those thrax rewriting grammars to cover more scenarios.
* Further more, nowadays commercial systems start to introduce RNN-like models into TN, and a mix of (rule-based & model-based) system is state-of-the-art. More readings about this, look for Richard Sproat and KyleGorman's work at Google.
END