You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
PaddleSpeech/cloud/README.md

82 lines
2.7 KiB

This file contains ambiguous Unicode characters!

This file contains ambiguous Unicode characters that may be confused with others in your current locale. If your use case is intentional and legitimate, you can safely ignore this warning. Use the Escape button to highlight these characters.

# Run DS2 on PaddleCloud
>Note:
>Make sure [PaddleCloud client](https://github.com/PaddlePaddle/cloud/blob/develop/doc/usage_cn.md#%E4%B8%8B%E8%BD%BD%E5%B9%B6%E9%85%8D%E7%BD%AEpaddlecloud) has be installed and current directory is `models/deep_speech_2/cloud/`
## Step-1 Configure data set
Configure your input data and output path in pcloud_submit.sh:
- `TRAIN_MANIFEST` Absolute path of train data manifest file in local file system.This file has format as bellow:
```
{"audio_filepath": "/home/disk1/LibriSpeech/dev-clean/1272/128104/1272-128104-0000.flac", "duration": 5.855, "text
": "mister quilter is the ..."}
{"audio_filepath": "/home/disk1/LibriSpeech/dev-clean/1272/128104/1272-128104-0001.flac", "duration": 4.815, "text
": "nor is mister ..."}
```
- `TEST_MANIFEST`: Absolute path of train data manifest file in local filesystem. This file has format like `TRAIN_MANIFEST`.
- `VOCAB_FILE`: Absolute path of vocabulary file in local filesytem.
- `MEAN_STD_FILE`: Absolute path of normalizer's statistic file in local filesytem.
- `CLOUD_DATA_DIR:` Absolute path in PaddleCloud filesystem. We will upload local train data to this directory.
- `CLOUD_MODEL_DIR`: Absolute path in PaddleCloud filesystem. PaddleCloud trainer will save model to this directory.
>Note: Upload will be skipped if target file has existed in `CLOUD_DATA_DIR`.
## Step-2 Configure computation resource
Configure computation resource in pcloud_submit.sh:
```
# Configure computation resource and submit job to PaddleCloud
paddlecloud submit \
-image wanghaoshuang/pcloud_ds2:latest \
-jobname ${JOB_NAME} \
-cpu 4 \
-gpu 4 \
-memory 10Gi \
-parallelism 1 \
-pscpu 1 \
-pservers 1 \
-psmemory 10Gi \
-passes 1 \
-entry "sh pcloud_train.sh ${CLOUD_DATA_DIR} ${CLOUD_MODEL_DIR}" \
${DS2_PATH}
```
For more information, please refer to [PaddleCloud](https://github.com/PaddlePaddle/cloud/blob/develop/doc/usage_cn.md#提交任务)
## Step-3 Configure algorithm options
Configure algorithm options in pcloud_train.sh:
```
python train.py \
--use_gpu=1 \
--trainer_count=4 \
--batch_size=256 \
--mean_std_filepath=$MEAN_STD_FILE \
--train_manifest_path='./local.train.manifest' \
--dev_manifest_path='./local.test.manifest' \
--vocab_filepath=$VOCAB_PATH \
--output_model_dir=${MODEL_PATH}
```
You can get more information about algorithm options by follow command:
```
cd ..
python train.py --help
```
## Step-4 Submit job
```
$ sh pcloud_submit.sh
```
## Step-5 Get logs
```
$ paddlecloud logs -n 10000 deepspeech20170727130129
```
For more information, please refer to [PaddleCloud client](https://github.com/PaddlePaddle/cloud/blob/develop/doc/usage_cn.md#下载并配置paddlecloud) or get help by follow command:
```
paddlecloud --help
```