commit
4133a562fb
Before Width: | Height: | Size: 206 KiB |
Before Width: | Height: | Size: 108 KiB |
@ -1,16 +0,0 @@
|
|||||||
# Benchmarks
|
|
||||||
|
|
||||||
## Acceleration with Multi-GPUs
|
|
||||||
|
|
||||||
We compare the training time with 1, 2, 4, 8 Tesla V100 GPUs (with a subset of LibriSpeech samples whose audio durations are between 6.0 and 7.0 seconds). And it shows that a **near-linear** acceleration with multiple GPUs has been achieved. In the following figure, the time (in seconds) cost for training is printed on the blue bars.
|
|
||||||
|
|
||||||
<img src="../images/multi_gpu_speedup.png" width=450>
|
|
||||||
|
|
||||||
| # of GPU | Acceleration Rate |
|
|
||||||
| -------- | --------------: |
|
|
||||||
| 1 | 1.00 X |
|
|
||||||
| 2 | 1.98 X |
|
|
||||||
| 4 | 3.73 X |
|
|
||||||
| 8 | 6.95 X |
|
|
||||||
|
|
||||||
`utils/profile.sh` provides such a demo profiling tool, you can change it as need.
|
|
Before Width: | Height: | Size: 93 KiB After Width: | Height: | Size: 93 KiB |
Before Width: | Height: | Size: 93 KiB After Width: | Height: | Size: 93 KiB |
@ -0,0 +1,58 @@
|
|||||||
|
# [CC-CEDICT](https://cc-cedict.org/wiki/)
|
||||||
|
|
||||||
|
What is CC-CEDICT?
|
||||||
|
CC-CEDICT is a continuation of the CEDICT project.
|
||||||
|
The objective of the CEDICT project was to create an online, downloadable (as opposed to searchable-only) public-domain Chinese-English dictionary.
|
||||||
|
CEDICT was started by Paul Andrew Denisowski in October 1997.
|
||||||
|
For the most part, the project is modeled on Jim Breen's highly successful EDICT (Japanese-English dictionary) project and is intended to be a collaborative effort,
|
||||||
|
with users providing entries and corrections to the main file.
|
||||||
|
|
||||||
|
|
||||||
|
## Parse CC-CEDICT to Json format
|
||||||
|
|
||||||
|
1. Parse to Json
|
||||||
|
|
||||||
|
```
|
||||||
|
run.sh
|
||||||
|
```
|
||||||
|
|
||||||
|
2. Result
|
||||||
|
|
||||||
|
```
|
||||||
|
exp/
|
||||||
|
|-- cedict
|
||||||
|
`-- cedict.json
|
||||||
|
|
||||||
|
0 directories, 2 files
|
||||||
|
```
|
||||||
|
|
||||||
|
```
|
||||||
|
4c4bffc84e24467fe1b2ea9ba37ed6b6 exp/cedict
|
||||||
|
3adf504dacd13886f88cc9fe3b37c75d exp/cedict.json
|
||||||
|
```
|
||||||
|
|
||||||
|
```
|
||||||
|
==> exp/cedict <==
|
||||||
|
# CC-CEDICT
|
||||||
|
# Community maintained free Chinese-English dictionary.
|
||||||
|
#
|
||||||
|
# Published by MDBG
|
||||||
|
#
|
||||||
|
# License:
|
||||||
|
# Creative Commons Attribution-ShareAlike 4.0 International License
|
||||||
|
# https://creativecommons.org/licenses/by-sa/4.0/
|
||||||
|
#
|
||||||
|
# Referenced works:
|
||||||
|
|
||||||
|
==> exp/cedict.json <==
|
||||||
|
{"traditional": "2019\u51a0\u72c0\u75c5\u6bd2\u75c5", "simplified": "2019\u51a0\u72b6\u75c5\u6bd2\u75c5", "pinyin": "er4 ling2 yi1 jiu3 guan1 zhuang4 bing4 du2 bing4", "english": "COVID-19, the coronavirus disease identified in 2019"}
|
||||||
|
{"traditional": "21\u4e09\u9ad4\u7d9c\u5408\u75c7", "simplified": "21\u4e09\u4f53\u7efc\u5408\u75c7", "pinyin": "er4 shi2 yi1 san1 ti3 zong1 he2 zheng4", "english": "trisomy"}
|
||||||
|
{"traditional": "3C", "simplified": "3C", "pinyin": "san1 C", "english": "abbr. for computers, communications, and consumer electronics"}
|
||||||
|
{"traditional": "3P", "simplified": "3P", "pinyin": "san1 P", "english": "(slang) threesome"}
|
||||||
|
{"traditional": "3Q", "simplified": "3Q", "pinyin": "san1 Q", "english": "(Internet slang) thank you (loanword)"}
|
||||||
|
{"traditional": "421", "simplified": "421", "pinyin": "si4 er4 yi1", "english": "four grandparents, two parents and an only child"}
|
||||||
|
{"traditional": "502\u81a0", "simplified": "502\u80f6", "pinyin": "wu3 ling2 er4 jiao1", "english": "cyanoacrylate glue"}
|
||||||
|
{"traditional": "88", "simplified": "88", "pinyin": "ba1 ba1", "english": "(Internet slang) bye-bye (alternative for \u62dc\u62dc[bai2 bai2])"}
|
||||||
|
{"traditional": "996", "simplified": "996", "pinyin": "jiu3 jiu3 liu4", "english": "9am-9pm, six days a week (work schedule)"}
|
||||||
|
{"traditional": "A", "simplified": "A", "pinyin": "A", "english": "(slang) (Tw) to steal"}
|
||||||
|
```
|
@ -1,5 +0,0 @@
|
|||||||
# Download Baker dataset
|
|
||||||
|
|
||||||
Baker dataset has to be downloaded mannually and moved to 'data/', because you will have to pass the CATTCHA from a browswe to download the dataset.
|
|
||||||
|
|
||||||
Download URL https://test.data-baker.com/#/data/index/source.
|
|
@ -0,0 +1,3 @@
|
|||||||
|
# G2P
|
||||||
|
|
||||||
|
* zh - Chinese G2P
|
@ -1,4 +1,4 @@
|
|||||||
export MAIN_ROOT=`realpath ${PWD}/../../`
|
export MAIN_ROOT=`realpath ${PWD}/../../../`
|
||||||
|
|
||||||
export PATH=${MAIN_ROOT}:${MAIN_ROOT}/utils:${PATH}
|
export PATH=${MAIN_ROOT}:${MAIN_ROOT}/utils:${PATH}
|
||||||
export LC_ALL=C
|
export LC_ALL=C
|
@ -0,0 +1,3 @@
|
|||||||
|
# Ngram LM
|
||||||
|
|
||||||
|
* s0 - kenlm ngram lm
|
@ -0,0 +1 @@
|
|||||||
|
data/lm
|
@ -1,7 +1,96 @@
|
|||||||
# [SentencePiece Model](https://github.com/google/sentencepiece)
|
# [SentencePiece Model](https://github.com/google/sentencepiece)
|
||||||
|
|
||||||
|
## Run
|
||||||
Train a `spm` model for English tokenizer.
|
Train a `spm` model for English tokenizer.
|
||||||
|
|
||||||
```
|
```
|
||||||
|
. path.sh
|
||||||
bash run.sh
|
bash run.sh
|
||||||
```
|
```
|
||||||
|
|
||||||
|
## Results
|
||||||
|
|
||||||
|
```
|
||||||
|
data/
|
||||||
|
└── lang_char
|
||||||
|
├── input.bpe
|
||||||
|
├── input.decode
|
||||||
|
├── input.txt
|
||||||
|
├── train_unigram100.model
|
||||||
|
├── train_unigram100_units.txt
|
||||||
|
└── train_unigram100.vocab
|
||||||
|
|
||||||
|
1 directory, 6 files
|
||||||
|
```
|
||||||
|
|
||||||
|
```
|
||||||
|
b5a230c26c61db5c36f34e503102f936 data/lang_char/input.bpe
|
||||||
|
ec5a9b24acc35469229e41256ceaf77d data/lang_char/input.decode
|
||||||
|
ec5a9b24acc35469229e41256ceaf77d data/lang_char/input.txt
|
||||||
|
124bf3fe7ce3b73b1994234c15268577 data/lang_char/train_unigram100.model
|
||||||
|
0df2488cc8eaace95eb12713facb5cf0 data/lang_char/train_unigram100_units.txt
|
||||||
|
46360cac35c751310e8e8ffd3a034cb5 data/lang_char/train_unigram100.vocab
|
||||||
|
```
|
||||||
|
|
||||||
|
```
|
||||||
|
==> data/lang_char/input.bpe <==
|
||||||
|
▁mi ster ▁quilter ▁ is ▁the ▁a p ost le ▁o f ▁the ▁mi d d le ▁c las s es ▁ and ▁we ▁ar e ▁g l a d ▁ to ▁we l c om e ▁h is ▁g o s pe l
|
||||||
|
▁ n or ▁ is ▁mi ster ▁quilter ' s ▁ma nne r ▁ l ess ▁in ter es t ing ▁tha n ▁h is ▁ma t ter
|
||||||
|
▁h e ▁ t e ll s ▁us ▁tha t ▁ at ▁ t h is ▁f es t ive ▁ s e ason ▁o f ▁the ▁ y e ar ▁w ith ▁ ch r is t m a s ▁ and ▁ro a s t ▁be e f ▁ l o om ing ▁be fore ▁us ▁ s i mile s ▁d r a w n ▁f r om ▁ e at ing ▁ and ▁it s ▁re s u l t s ▁o c c ur ▁m ost ▁re a di l y ▁ to ▁the ▁ mind
|
||||||
|
▁h e ▁ ha s ▁g r a v e ▁d o u b t s ▁w h e t h er ▁ s i r ▁f r e d er ic k ▁ l eig h to n ' s ▁w or k ▁ is ▁re all y ▁gre e k ▁a f ter ▁ all ▁ and ▁c a n ▁di s c o v er ▁in ▁it ▁b u t ▁li t t le ▁o f ▁ro ck y ▁it ha c a
|
||||||
|
▁li nne ll ' s ▁ p ic tur es ▁ar e ▁a ▁ s or t ▁o f ▁ u p ▁g u ar d s ▁ and ▁ at ▁ em ▁painting s ▁ and ▁m ason ' s ▁ e x q u is i t e ▁ i d y ll s ▁ar e ▁a s ▁ n at ion a l ▁a s ▁a ▁ j ing o ▁ p o em ▁mi ster ▁b i r k e t ▁f o ster ' s ▁ l and s c a pe s ▁ s mile ▁ at ▁on e ▁m u ch ▁in ▁the ▁ s a m e ▁w a y ▁tha t ▁mi ster ▁c ar k er ▁us e d ▁ to ▁f las h ▁h is ▁ t e e t h ▁ and ▁mi ster ▁ j o h n ▁c o ll i er ▁g ive s ▁h is ▁ s i t ter ▁a ▁ ch e er f u l ▁ s l a p ▁on ▁the ▁b a ck ▁be fore ▁h
|
||||||
|
e ▁ s a y s ▁li k e ▁a ▁ s ha m p o o er ▁in ▁a ▁ tur k is h ▁b at h ▁ n e x t ▁ma n
|
||||||
|
▁it ▁ is ▁o b v i o u s l y ▁ u nne c ess ar y ▁for ▁us ▁ to ▁ p o i n t ▁o u t ▁h o w ▁ l u m i n o u s ▁the s e ▁c rit ic is m s ▁ar e ▁h o w ▁d e l ic at e ▁in ▁ e x p r ess ion
|
||||||
|
▁on ▁the ▁g e n er a l ▁ p r i n c i p l es ▁o f ▁ar t ▁mi ster ▁quilter ▁w rit es ▁w ith ▁ e qual ▁ l u c i di t y
|
||||||
|
▁painting ▁h e ▁ t e ll s ▁us ▁ is ▁o f ▁a ▁di f f er e n t ▁ qual i t y ▁ to ▁ma t h em at ic s ▁ and ▁f i nish ▁in ▁ar t ▁ is ▁a d d ing ▁m or e ▁f a c t
|
||||||
|
▁a s ▁for ▁ e t ch ing s ▁the y ▁ar e ▁o f ▁ t w o ▁ k i n d s ▁b rit is h ▁ and ▁for eig n
|
||||||
|
▁h e ▁ l a ment s ▁m ost ▁b i t ter l y ▁the ▁di v or c e ▁tha t ▁ ha s ▁be e n ▁ma d e ▁be t w e e n ▁d e c or at ive ▁ar t ▁ and ▁w ha t ▁we ▁us u all y ▁c all ▁ p ic tur es ▁ma k es ▁the ▁c u s t om ar y ▁a p pe a l ▁ to ▁the ▁ las t ▁ j u d g ment ▁ and ▁re mind s ▁us ▁tha t ▁in ▁the ▁gre at ▁d a y s ▁o f ▁ar t ▁mi c ha e l ▁a n g e l o ▁w a s ▁the ▁f ur nish ing ▁ u p h o l ster er
|
||||||
|
|
||||||
|
==> data/lang_char/input.decode <==
|
||||||
|
mister quilter is the apostle of the middle classes and we are glad to welcome his gospel
|
||||||
|
nor is mister quilter's manner less interesting than his matter
|
||||||
|
he tells us that at this festive season of the year with christmas and roast beef looming before us similes drawn from eating and its results occur most readily to the mind
|
||||||
|
he has grave doubts whether sir frederick leighton's work is really greek after all and can discover in it but little of rocky ithaca
|
||||||
|
linnell's pictures are a sort of up guards and at em paintings and mason's exquisite idylls are as national as a jingo poem mister birket foster's landscapes smile at one much in the same way that mister carker used to flash his teeth and mister john collier gives his sitter a cheerful slap on the back before he says like a shampooer in a turkish bath next man
|
||||||
|
it is obviously unnecessary for us to point out how luminous these criticisms are how delicate in expression
|
||||||
|
on the general principles of art mister quilter writes with equal lucidity
|
||||||
|
painting he tells us is of a different quality to mathematics and finish in art is adding more fact
|
||||||
|
as for etchings they are of two kinds british and foreign
|
||||||
|
he laments most bitterly the divorce that has been made between decorative art and what we usually call pictures makes the customary appeal to the last judgment and reminds us that in the great days of art michael angelo was the furnishing upholsterer
|
||||||
|
|
||||||
|
==> data/lang_char/input.txt <==
|
||||||
|
mister quilter is the apostle of the middle classes and we are glad to welcome his gospel
|
||||||
|
nor is mister quilter's manner less interesting than his matter
|
||||||
|
he tells us that at this festive season of the year with christmas and roast beef looming before us similes drawn from eating and its results occur most readily to the mind
|
||||||
|
he has grave doubts whether sir frederick leighton's work is really greek after all and can discover in it but little of rocky ithaca
|
||||||
|
linnell's pictures are a sort of up guards and at em paintings and mason's exquisite idylls are as national as a jingo poem mister birket foster's landscapes smile at one much in the same way that mister carker used to flash his teeth and mister john collier gives his sitter a cheerful slap on the back before he says like a shampooer in a turkish bath next man
|
||||||
|
it is obviously unnecessary for us to point out how luminous these criticisms are how delicate in expression
|
||||||
|
on the general principles of art mister quilter writes with equal lucidity
|
||||||
|
painting he tells us is of a different quality to mathematics and finish in art is adding more fact
|
||||||
|
as for etchings they are of two kinds british and foreign
|
||||||
|
he laments most bitterly the divorce that has been made between decorative art and what we usually call pictures makes the customary appeal to the last judgment and reminds us that in the great days of art michael angelo was the furnishing upholsterer
|
||||||
|
|
||||||
|
==> data/lang_char/train_unigram100_units.txt <==
|
||||||
|
<blank> 0
|
||||||
|
<unk> 1
|
||||||
|
' 2
|
||||||
|
a 3
|
||||||
|
all 4
|
||||||
|
and 5
|
||||||
|
ar 6
|
||||||
|
ason 7
|
||||||
|
at 8
|
||||||
|
b 9
|
||||||
|
|
||||||
|
==> data/lang_char/train_unigram100.vocab <==
|
||||||
|
<unk> 0
|
||||||
|
<s> 0
|
||||||
|
</s> 0
|
||||||
|
▁ -2.01742
|
||||||
|
e -2.7203
|
||||||
|
s -2.82989
|
||||||
|
t -2.99689
|
||||||
|
l -3.53267
|
||||||
|
n -3.84935
|
||||||
|
o -3.88229
|
||||||
|
```
|
||||||
|
@ -1,3 +0,0 @@
|
|||||||
# Regular expression based text normalization for Chinese
|
|
||||||
|
|
||||||
For simplicity and ease of implementation, text normalization is basically done by rules and dictionaries. Here's an example.
|
|
@ -0,0 +1 @@
|
|||||||
|
exp
|
Loading…
Reference in new issue