You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
PaddleSpeech/cloud/README.md

71 lines
3.4 KiB

# Train DeepSpeech2 on PaddleCloud
>Note:
>Please make sure [PaddleCloud Client](https://github.com/PaddlePaddle/cloud/blob/develop/doc/usage_cn.md#%E4%B8%8B%E8%BD%BD%E5%B9%B6%E9%85%8D%E7%BD%AEpaddlecloud) has be installed and current directory is `deep_speech_2/cloud/`
## Step 1: Upload Data
Provided with several input manifests, `pcloud_upload_data.sh` will pack and upload all the containing audio files to PaddleCloud filesystem, and also generate some corresponding manifest files with updated cloud paths.
Please modify the following arguments in `pcloud_upload_data.sh`:
- `IN_MANIFESTS` Paths (in local filesystem) of manifest files containing the audio files to be uploaded. Multiple paths can be concatenated with a whitespace delimeter. Lines of manifest files are in the following format:
```
{"audio_filepath": "/home/disk1/LibriSpeech/dev-clean/1272/128104/1272-128104-0000.flac", "duration": 5.855, "text
": "mister quilter is the ..."}
{"audio_filepath": "/home/disk1/LibriSpeech/dev-clean/1272/128104/1272-128104-0001.flac", "duration": 4.815, "text
": "nor is mister ..."}
```
- `OUT_MANIFESTS`: Paths (in local filesystem) to write the updated output manifest files to. Multiple paths can be concatenated with a whitespace delimeter. The values of `audio_filepath` in the output manifests are jjjjjkknew paths in PaddleCloud filesystem.
- `CLOUD_DATA_DIR`: Directory (in PaddleCloud filesystem) to upload the data to.
- `NUM_SHARDS`: Number of data shards / parts (in tar files) to be generated when packing and uploading data. Smaller `num_shards` requires larger temoporal local disk space for packing data.
By running:
```
sh pcloud_upload_data.sh
```
all the audio files will be uploaded to PaddleCloud filesystem, and you will get modified manifests files in `OUT_MANIFESTS`.
You have to take this step only once, when it is your first time to do the cloud training. Later on, the data is persisitent on the cloud filesystem and is reusable for multple jobs.
## Step 2: Configure Training
Configure cloud training arguments in `pcloud_submit.sh`, with the following arguments:
- `TRAIN_MANIFEST`: Manifest filepath (in local filesystem) for training. Notice that the`audio_filepath` should be in cloud filesystem, like those generated by `pcloud_upload_data.sh`.
- `DEV_MANIFEST`: Manifest filepath (in local filesystem) for validation.
- `CLOUD_MODEL_DIR`: Directory (in PaddleCloud filesystem) to save the model parameters (checkpoints).
- `BATCH_SIZE`: Training batch size for a single node.
- `NUM_GPU`: Number of GPUs allocated for a single node.
- `NUM_NODE`: Number of nodes (machines) allocated for this job.
- `IS_LOCAL`: Set to False to enable parameter server, if using multiple nodes.
Configure other training hyper-parameters in `pcloud_train.sh` as you wish, just as what you can do in local training.
By running:
```
sh pcloud_submit.sh
```
you submit a training job to PaddleCloud. And you will see the job name when the submission is done.
## Step 3 Get Job Logs
Run this to list all the jobs you have submitted, as well as their running status:
```
paddlecloud get jobs
```
Run this, the corresponding job's logs will be printed.
```
paddlecloud logs -n 10000 $REPLACED_WITH_YOUR_ACTUAL_JOB_NAME
```
## More Help
For more information about the usage of PaddleCloud, please refer to [PaddleCloud Usage](https://github.com/PaddlePaddle/cloud/blob/develop/doc/usage_cn.md#提交任务).