History

Xinghai Sun 5b6bbe9d18 Merge branch 'develop' into bug_fix		7 years ago
..
README.md	Re-organize folder structure and hierarchy for DS2.	7 years ago
_init_paths.py	Bug fix and refine cloud training for DS2.	7 years ago
pcloud_submit.sh	fix bugs for model.py and demo_server.py.	7 years ago
pcloud_train.sh	Merge branch 'develop' into bug_fix	7 years ago
pcloud_upload_data.sh	fix bugs for model.py and demo_server.py.	7 years ago
split_data.py	Seperate data uploading from job summission for DS2 cloud training and add support for multiple shards uploading.	7 years ago
upload_data.py	Update DS2 cloud training according to review comments.	7 years ago

README.md

Unescape Escape

Train DeepSpeech2 on PaddleCloud

Note: Please make sure PaddleCloud Client has be installed and current directory is deep_speech_2/cloud/

Step 1: Upload Data

Provided with several input manifests, pcloud_upload_data.sh will pack and upload all the containing audio files to PaddleCloud filesystem, and also generate some corresponding manifest files with updated cloud paths.

Please modify the following arguments in pcloud_upload_data.sh:

IN_MANIFESTS： Paths (in local filesystem) of manifest files containing the audio files to be uploaded. Multiple paths can be concatenated with a whitespace delimeter.
OUT_MANIFESTS: Paths (in local filesystem) to write the updated output manifest files to. Multiple paths can be concatenated with a whitespace delimeter. The values of audio_filepath in the output manifests are updated with cloud filesystem paths.
CLOUD_DATA_DIR: Directory (in PaddleCloud filesystem) to upload the data to. Don't forget to replace USERNAME in the default directory and make sure that you have the permission to write it.
NUM_SHARDS: Number of data shards / parts (in tar files) to be generated when packing and uploading data. Smaller num_shards requires larger temoporal local disk space for packing data.

By running:

sh pcloud_upload_data.sh

all the audio files will be uploaded to PaddleCloud filesystem, and you will get modified manifests files in OUT_MANIFESTS.

You have to take this step only once, in the very first time you do the cloud training. Later on, the data is persisitent on the cloud filesystem and reusable for further job submissions.

Step 2: Configure Training

Configure cloud training arguments in pcloud_submit.sh, with the following arguments:

TRAIN_MANIFEST: Manifest filepath (in local filesystem) for training. Notice that theaudio_filepath should be in cloud filesystem, like those generated by pcloud_upload_data.sh.
DEV_MANIFEST: Manifest filepath (in local filesystem) for validation.
CLOUD_MODEL_DIR: Directory (in PaddleCloud filesystem) to save the model parameters (checkpoints). Don't forget to replace USERNAME in the default directory and make sure that you have the permission to write it.
BATCH_SIZE: Training batch size for a single node.
NUM_GPU: Number of GPUs allocated for a single node.
NUM_NODE: Number of nodes (machines) allocated for this job.
IS_LOCAL: Set to False to enable parameter server, if using multiple nodes.

Configure other training hyper-parameters in pcloud_train.sh as you wish, just as what you can do in local training.

By running:

sh pcloud_submit.sh

you submit a training job to PaddleCloud. And you will see the job name when the submission is done.

Step 3 Get Job Logs

Run this to list all the jobs you have submitted, as well as their running status:

paddlecloud get jobs

Run this, the corresponding job's logs will be printed.

paddlecloud logs -n 10000 $REPLACED_WITH_YOUR_ACTUAL_JOB_NAME

More Help

For more information about the usage of PaddleCloud, please refer to PaddleCloud Usage.