|
|
# Run DS2 on PaddleCloud
|
|
|
|
|
|
>Note:
|
|
|
>Make sure [PaddleCloud client](https://github.com/PaddlePaddle/cloud/blob/develop/doc/usage_cn.md#%E4%B8%8B%E8%BD%BD%E5%B9%B6%E9%85%8D%E7%BD%AEpaddlecloud) has be installed and current directory is `models/deep_speech_2/cloud/`
|
|
|
|
|
|
## Step-1 Configure data set
|
|
|
|
|
|
Configure your input data and output path in pcloud_submit.sh:
|
|
|
|
|
|
- `TRAIN_MANIFEST`: Absolute path of train data manifest file in local file system.This file has format as bellow:
|
|
|
|
|
|
```
|
|
|
{"audio_filepath": "/home/disk1/LibriSpeech/dev-clean/1272/128104/1272-128104-0000.flac", "duration": 5.855, "text
|
|
|
": "mister quilter is the ..."}
|
|
|
{"audio_filepath": "/home/disk1/LibriSpeech/dev-clean/1272/128104/1272-128104-0001.flac", "duration": 4.815, "text
|
|
|
": "nor is mister ..."}
|
|
|
```
|
|
|
|
|
|
- `TEST_MANIFEST`: Absolute path of train data manifest file in local filesystem. This file has format like `TRAIN_MANIFEST`.
|
|
|
- `VOCAB_FILE`: Absolute path of vocabulary file in local filesytem.
|
|
|
- `MEAN_STD_FILE`: Absolute path of normalizer's statistic file in local filesytem.
|
|
|
- `CLOUD_DATA_DIR:` Absolute path in PaddleCloud filesystem. We will upload local train data to this directory.
|
|
|
- `CLOUD_MODEL_DIR`: Absolute path in PaddleCloud filesystem. PaddleCloud trainer will save model to this directory.
|
|
|
|
|
|
>Note: Upload will be skipped if target file has existed in `CLOUD_DATA_DIR`.
|
|
|
|
|
|
## Step-2 Configure computation resource
|
|
|
|
|
|
Configure computation resource in pcloud_submit.sh:
|
|
|
|
|
|
```
|
|
|
# Configure computation resource and submit job to PaddleCloud
|
|
|
paddlecloud submit \
|
|
|
-image wanghaoshuang/pcloud_ds2:latest \
|
|
|
-jobname ${JOB_NAME} \
|
|
|
-cpu 4 \
|
|
|
-gpu 4 \
|
|
|
-memory 10Gi \
|
|
|
-parallelism 1 \
|
|
|
-pscpu 1 \
|
|
|
-pservers 1 \
|
|
|
-psmemory 10Gi \
|
|
|
-passes 1 \
|
|
|
-entry "sh pcloud_train.sh ${CLOUD_DATA_DIR} ${CLOUD_MODEL_DIR}" \
|
|
|
${DS2_PATH}
|
|
|
```
|
|
|
For more information, please refer to [PaddleCloud](https://github.com/PaddlePaddle/cloud/blob/develop/doc/usage_cn.md#提交任务)
|
|
|
|
|
|
## Step-3 Configure algorithm options
|
|
|
Configure algorithm options in pcloud_train.sh:
|
|
|
```
|
|
|
python train.py \
|
|
|
--use_gpu=1 \
|
|
|
--trainer_count=4 \
|
|
|
--batch_size=256 \
|
|
|
--mean_std_filepath=$MEAN_STD_FILE \
|
|
|
--train_manifest_path='./local.train.manifest' \
|
|
|
--dev_manifest_path='./local.test.manifest' \
|
|
|
--vocab_filepath=$VOCAB_PATH \
|
|
|
--output_model_dir=${MODEL_PATH}
|
|
|
```
|
|
|
You can get more information about algorithm options by follow command:
|
|
|
```
|
|
|
cd ..
|
|
|
python train.py --help
|
|
|
```
|
|
|
|
|
|
## Step-4 Submit job
|
|
|
```
|
|
|
$ sh pcloud_submit.sh
|
|
|
```
|
|
|
|
|
|
|
|
|
## Step-5 Get logs
|
|
|
```
|
|
|
$ paddlecloud logs -n 10000 deepspeech20170727130129
|
|
|
```
|
|
|
For more information, please refer to [PaddleCloud client](https://github.com/PaddlePaddle/cloud/blob/develop/doc/usage_cn.md#下载并配置paddlecloud) or get help by follow command:
|
|
|
```
|
|
|
paddlecloud --help
|
|
|
```
|