diff --git a/doc/images/jieba_tags.png b/doc/images/jieba_tags.png new file mode 100644 index 000000000..a74a4ac88 Binary files /dev/null and b/doc/images/jieba_tags.png differ diff --git a/doc/src/asr_postprocess.md b/doc/src/asr_text_backend.md similarity index 98% rename from doc/src/asr_postprocess.md rename to doc/src/asr_text_backend.md index 0c84181d1..879e56f8a 100644 --- a/doc/src/asr_postprocess.md +++ b/doc/src/asr_text_backend.md @@ -1,4 +1,4 @@ -# ASR PostProcess +# ASR Text Backend 1. [Text Segmentation](text_front_end#text segmentation) 2. Text Corrector @@ -98,4 +98,4 @@ ## Text Filter -* 敏感词(黄暴、涉政、违法违禁等) +* 敏感词(黄暴、涉政、违法违禁等) \ No newline at end of file diff --git a/doc/src/benchmark.md b/doc/src/benchmark.md index 1f78223cb..f3af25552 100644 --- a/doc/src/benchmark.md +++ b/doc/src/benchmark.md @@ -4,7 +4,7 @@ We compare the training time with 1, 2, 4, 8 Tesla V100 GPUs (with a subset of LibriSpeech samples whose audio durations are between 6.0 and 7.0 seconds). And it shows that a **near-linear** acceleration with multiple GPUs has been achieved. In the following figure, the time (in seconds) cost for training is printed on the blue bars. -
+ | # of GPU | Acceleration Rate | | -------- | --------------: | @@ -14,3 +14,4 @@ We compare the training time with 1, 2, 4, 8 Tesla V100 GPUs (with a subset of L | 8 | 6.95 X | `utils/profile.sh` provides such a demo profiling tool, you can change it as need. + diff --git a/doc/src/chinese_syllable.md b/doc/src/chinese_syllable.md index 7ccfe3dae..676ecb531 100644 --- a/doc/src/chinese_syllable.md +++ b/doc/src/chinese_syllable.md @@ -13,8 +13,6 @@ There are a total of 410 common pinyin syllables. - - * [Rare syllable](https://resources.allsetlearning.com/chinese/pronunciation/Rare_syllable) * [Chinese Pronunciation: The Complete Guide for Beginner](https://www.digmandarin.com/chinese-pronunciation-guide.html) @@ -50,4 +48,4 @@ ## Zhuyin * [Bopomofo](https://en.wikipedia.org/wiki/Bopomofo) -* [Zhuyin table](https://en.wikipedia.org/wiki/Zhuyin_table) +* [Zhuyin table](https://en.wikipedia.org/wiki/Zhuyin_table) \ No newline at end of file diff --git a/doc/src/dataset.md b/doc/src/dataset.md index 231773a9b..d70d0e0d2 100644 --- a/doc/src/dataset.md +++ b/doc/src/dataset.md @@ -13,3 +13,9 @@ * [Tatoeba](https://tatoeba.org/cmn) **Tatoeba is a collection of sentences and translations.** It's collaborative, open, free and even addictive. An open data initiative aimed at translation and speech recognition. + + + +### ASR Noise + +* [asr-noises](https://github.com/speechio/asr-noises) \ No newline at end of file diff --git a/doc/src/ngram_lm.md b/doc/src/ngram_lm.md index 119a3b21c..07aa5411c 100644 --- a/doc/src/ngram_lm.md +++ b/doc/src/ngram_lm.md @@ -83,4 +83,4 @@ Please notice that the released language models only contain Chinese simplified ``` build/bin/build_binary ./result/people2014corpus_words.arps ./result/people2014corpus_words.klm - ``` + ``` \ No newline at end of file diff --git a/doc/src/tools.md b/doc/src/tools.md new file mode 100644 index 000000000..4ec09f6a2 --- /dev/null +++ b/doc/src/tools.md @@ -0,0 +1,4 @@ +# Useful Tools + +* [正则可视化和常用正则表达式](https://wangwl.net/static/projects/visualRegex/#) + diff --git a/doc/src/text_front_end.md b/doc/src/tts_text_front_end.md similarity index 69% rename from doc/src/text_front_end.md rename to doc/src/tts_text_front_end.md index 64d5cdb0f..6eb9ae5d9 100644 --- a/doc/src/text_front_end.md +++ b/doc/src/tts_text_front_end.md @@ -13,6 +13,37 @@ There are various libraries including some of the most popular ones like NLTK, S ## Text Normalization(文本正则) +The **basic preprocessing steps** that occur in English NLP, including data cleaning, stemming/lemmatization, tokenization and stop words. **not all of these steps are necessary for Chinese text data!** + +### Lexicon Normalization + +There’s a concept similar to stems in this language, and they’re called Radicals. **Radicals are basically the building blocks of Chinese characters.** All Chinese characters are made up of a finite number of components which are put together in different orders and combinations. Radicals are usually the leftmost part of the character. There are around 200 radicals in Chinese, and they are used to index and categorize characters. + +Therefore, procedures like stemming and lemmatization are not useful for Chinese text data because seperating the radicals would **change the word’s meaning entirely**. + +### Tokenization + +**Tokenizing breaks up text data into shorter pre-set strings**, which help build context and meaning for the machine learning model. + +These “tags” label the part of speech. There are 24 part of speech tags and 4 proper name category labels in the `**jieba**` package’s existing dictionary. + + + +### Stop Words + +In NLP, **stop words are “meaningless” words** that make the data too noisy or ambiguous. + +Instead of manually removing them, you could import the `**stopwordsiso**` package for a full list of Chinese stop words. More information can be found [here](https://pypi.org/project/stopwordsiso/). And with this, we can easily create code to filter out any stop words in large text data. + +```python +!pip install stopwordsiso +import stopwordsiso +from stopwordsiso import stopwords +stopwords(["zh"]) # Chinese +``` + + + 文本正则化 文本正则化主要是讲非标准词(NSW)进行转化,比如: 数字、电话号码: 10086 -> 一千零八十六/幺零零八六 @@ -25,6 +56,14 @@ There are various libraries including some of the most popular ones like NLTK, S * https://github.com/speechio/chinese_text_normalization +* [vinorm](https://github.com/NoahDrisort/vinorm) [cpp_verion](https://github.com/NoahDrisort/vinorm_cpp_version) + + Python package for text normalization, use for frontend of Text-to-speech Reseach + +* https://github.com/candlewill/CNTN + + This is a ChiNese Text Normalization (CNTN) tool for Text-to-speech system, which is based on [sparrowhawk](https://github.com/google/sparrowhawk). + ## Word Segmentation(分词) @@ -42,6 +81,7 @@ There are various libraries including some of the most popular ones like NLTK, S * https://github.com/thunlp/THULAC-Python * https://github.com/fxsjy/jieba * CRF++ +* https://github.com/isnowfy/snownlp ### MMSEG * [MMSEG: A Word Identification System for Mandarin Chinese Text Based on Two Variants of the Maximum Matching Algorithm](http://technology.chtsai.org/mmseg/) @@ -101,8 +141,7 @@ LP -> LO -> L1(#1) -> L2(#2) -> L3(#3) -> L4(#4) -> L5 -> L6 -> L7 常用方法使用的是级联CRF,首先预测如果是PW,再继续预测是否是PPH,再预测是否是IPH -
- + 论文: 2015 .Ding Et al. - Automatic Prosody Prediction For Chinese Speech Synthesis Using BLSTM-RNN and Embedding Features @@ -148,3 +187,5 @@ TN: 基于规则的方法 ## Reference * [Text Front End](https://slyne.github.io/%E5%85%AC%E5%BC%80%E8%AF%BE/2020/10/03/TTS1/) +* [Chinese Natural Language (Pre)processing: An Introduction](https://towardsdatascience.com/chinese-natural-language-pre-processing-an-introduction-995d16c2705f) +* [Beginner’s Guide to Sentiment Analysis for Simplified Chinese using SnowNLP](https://towardsdatascience.com/beginners-guide-to-sentiment-analysis-for-simplified-chinese-using-snownlp-ce88a8407efb) \ No newline at end of file