add text normlization (#620)

* add text normlization

* add space
pull/625/head
Hui Zhang 4 years ago committed by GitHub
parent d0635c6592
commit 4853761542
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23

Binary file not shown.

After

Width:  |  Height:  |  Size: 46 KiB

@ -1,4 +1,4 @@
# ASR PostProcess
# ASR Text Backend
1. [Text Segmentation](text_front_end#text segmentation)
2. Text Corrector
@ -98,4 +98,4 @@
## Text Filter
* 敏感词(黄暴、涉政、违法违禁等)
* 敏感词(黄暴、涉政、违法违禁等)

@ -4,7 +4,7 @@
We compare the training time with 1, 2, 4, 8 Tesla V100 GPUs (with a subset of LibriSpeech samples whose audio durations are between 6.0 and 7.0 seconds). And it shows that a **near-linear** acceleration with multiple GPUs has been achieved. In the following figure, the time (in seconds) cost for training is printed on the blue bars.
<img src="../images/multi_gpu_speedup.png" width=450><br/>
<img src="../images/multi_gpu_speedup.png" width=450>
| # of GPU | Acceleration Rate |
| -------- | --------------: |
@ -14,3 +14,4 @@ We compare the training time with 1, 2, 4, 8 Tesla V100 GPUs (with a subset of L
| 8 | 6.95 X |
`utils/profile.sh` provides such a demo profiling tool, you can change it as need.

@ -13,8 +13,6 @@
There are a total of 410 common pinyin syllables.
* [Rare syllable](https://resources.allsetlearning.com/chinese/pronunciation/Rare_syllable)
* [Chinese Pronunciation: The Complete Guide for Beginner](https://www.digmandarin.com/chinese-pronunciation-guide.html)
@ -50,4 +48,4 @@
## Zhuyin
* [Bopomofo](https://en.wikipedia.org/wiki/Bopomofo)
* [Zhuyin table](https://en.wikipedia.org/wiki/Zhuyin_table)
* [Zhuyin table](https://en.wikipedia.org/wiki/Zhuyin_table)

@ -13,3 +13,9 @@
* [Tatoeba](https://tatoeba.org/cmn)
**Tatoeba is a collection of sentences and translations.** It's collaborative, open, free and even addictive. An open data initiative aimed at translation and speech recognition.
### ASR Noise
* [asr-noises](https://github.com/speechio/asr-noises)

@ -83,4 +83,4 @@ Please notice that the released language models only contain Chinese simplified
```
build/bin/build_binary ./result/people2014corpus_words.arps ./result/people2014corpus_words.klm
```
```

@ -0,0 +1,4 @@
# Useful Tools
* [正则可视化和常用正则表达式](https://wangwl.net/static/projects/visualRegex/#)

@ -13,6 +13,37 @@ There are various libraries including some of the most popular ones like NLTK, S
## Text Normalization(文本正则)
The **basic preprocessing steps** that occur in English NLP, including data cleaning, stemming/lemmatization, tokenization and stop words. **not all of these steps are necessary for Chinese text data!**
### Lexicon Normalization
Theres a concept similar to stems in this language, and theyre called Radicals. **Radicals are basically the building blocks of Chinese characters.** All Chinese characters are made up of a finite number of components which are put together in different orders and combinations. Radicals are usually the leftmost part of the character. There are around 200 radicals in Chinese, and they are used to index and categorize characters.
Therefore, procedures like stemming and lemmatization are not useful for Chinese text data because seperating the radicals would **change the words meaning entirely**.
### Tokenization
**Tokenizing breaks up text data into shorter pre-set strings**, which help build context and meaning for the machine learning model.
These “tags” label the part of speech. There are 24 part of speech tags and 4 proper name category labels in the `**jieba**` packages existing dictionary.
<img src="../images/jieba_tags.png" width=650>
### Stop Words
In NLP, **stop words are “meaningless” words** that make the data too noisy or ambiguous.
Instead of manually removing them, you could import the `**stopwordsiso**` package for a full list of Chinese stop words. More information can be found [here](https://pypi.org/project/stopwordsiso/). And with this, we can easily create code to filter out any stop words in large text data.
```python
!pip install stopwordsiso
import stopwordsiso
from stopwordsiso import stopwords
stopwords(["zh"]) # Chinese
```
文本正则化 文本正则化主要是讲非标准词(NSW)进行转化,比如:
数字、电话号码: 10086 -> 一千零八十六/幺零零八六
@ -25,6 +56,14 @@ There are various libraries including some of the most popular ones like NLTK, S
* https://github.com/speechio/chinese_text_normalization
* [vinorm](https://github.com/NoahDrisort/vinorm) [cpp_verion](https://github.com/NoahDrisort/vinorm_cpp_version)
Python package for text normalization, use for frontend of Text-to-speech Reseach
* https://github.com/candlewill/CNTN
This is a ChiNese Text Normalization (CNTN) tool for Text-to-speech system, which is based on [sparrowhawk](https://github.com/google/sparrowhawk).
## Word Segmentation(分词)
@ -42,6 +81,7 @@ There are various libraries including some of the most popular ones like NLTK, S
* https://github.com/thunlp/THULAC-Python
* https://github.com/fxsjy/jieba
* CRF++
* https://github.com/isnowfy/snownlp
### MMSEG
* [MMSEG: A Word Identification System for Mandarin Chinese Text Based on Two Variants of the Maximum Matching Algorithm](http://technology.chtsai.org/mmseg/)
@ -101,8 +141,7 @@ LP -> LO -> L1(#1) -> L2(#2) -> L3(#3) -> L4(#4) -> L5 -> L6 -> L7
常用方法使用的是级联CRF首先预测如果是PW再继续预测是否是PPH再预测是否是IPH
<img src="../images/prosody.jpeg" width=450><br/>
<img src="../images/prosody.jpeg" width=450>
论文: 2015 .Ding Et al. - Automatic Prosody Prediction For Chinese Speech Synthesis Using BLSTM-RNN and Embedding Features
@ -148,3 +187,5 @@ TN: 基于规则的方法
## Reference
* [Text Front End](https://slyne.github.io/%E5%85%AC%E5%BC%80%E8%AF%BE/2020/10/03/TTS1/)
* [Chinese Natural Language (Pre)processing: An Introduction](https://towardsdatascience.com/chinese-natural-language-pre-processing-an-introduction-995d16c2705f)
* [Beginners Guide to Sentiment Analysis for Simplified Chinese using SnowNLP](https://towardsdatascience.com/beginners-guide-to-sentiment-analysis-for-simplified-chinese-using-snownlp-ce88a8407efb)
Loading…
Cancel
Save