PaddleSpeech/cloud/README.md

# Run DS2 on PaddleCloud

>Note:
>Make sure [PaddleCloud client](https://github.com/PaddlePaddle/cloud/blob/develop/doc/usage_cn.md#%E4%B8%8B%E8%BD%BD%E5%B9%B6%E9%85%8D%E7%BD%AEpaddlecloud) has be installed and current directory is `models/deep_speech_2/cloud/`

## Step-1  Configure data set

Configure your input data and output path in pcloud_submit.sh:

-  `TRAIN_MANIFEST`： Absolute path of train data manifest file in local file system.This file has format as bellow:

```
{"audio_filepath": "/home/disk1/LibriSpeech/dev-clean/1272/128104/1272-128104-0000.flac", "duration": 5.855, "text
": "mister quilter is the ..."}
{"audio_filepath": "/home/disk1/LibriSpeech/dev-clean/1272/128104/1272-128104-0001.flac", "duration": 4.815, "text
": "nor is mister ..."}
```

- `TEST_MANIFEST`: Absolute path of train data manifest file in local filesystem. This file has format like `TRAIN_MANIFEST`.
- `VOCAB_FILE`:  Absolute path of vocabulary file in local filesytem.
- `MEAN_STD_FILE`: Absolute path of normalizer's statistic file in local filesytem.
- `CLOUD_DATA_DIR:`  Absolute path in PaddleCloud filesystem. We will upload local train data to this directory.
- `CLOUD_MODEL_DIR`: Absolute path in PaddleCloud filesystem. PaddleCloud trainer will save model to this directory.

>Note: Upload will be skipped if target file has existed in `CLOUD_DATA_DIR`.

## Step-2  Configure computation resource

Configure computation resource in pcloud_submit.sh:

```
# Configure computation resource and submit job to PaddleCloud
 paddlecloud submit \
 -image wanghaoshuang/pcloud_ds2:latest \
 -jobname ${JOB_NAME} \
 -cpu 4 \
 -gpu 4 \
 -memory 10Gi \
 -parallelism 1 \
 -pscpu 1 \
 -pservers 1 \
 -psmemory 10Gi \
 -passes 1 \
 -entry "sh pcloud_train.sh ${CLOUD_DATA_DIR} ${CLOUD_MODEL_DIR}" \
 ${DS2_PATH}
```
For more information, please refer to [PaddleCloud](https://github.com/PaddlePaddle/cloud/blob/develop/doc/usage_cn.md#提交任务)

## Step-3  Configure algorithm options
Configure algorithm options in pcloud_train.sh:
```
python train.py \
--use_gpu=1 \
--trainer_count=4 \
--batch_size=256 \
--mean_std_filepath=$MEAN_STD_FILE \
--train_manifest_path='./local.train.manifest' \
--dev_manifest_path='./local.test.manifest' \
--vocab_filepath=$VOCAB_PATH \
--output_model_dir=${MODEL_PATH}
```
You can get more information about algorithm options by follow command:
```
cd ..
python train.py --help
```

## Step-4  Submit job
```
$ sh pcloud_submit.sh
```


## Step-5 Get logs
```
$ paddlecloud logs -n 10000 deepspeech20170727130129
```
For more information, please refer to [PaddleCloud client](https://github.com/PaddlePaddle/cloud/blob/develop/doc/usage_cn.md#下载并配置paddlecloud) or get help by follow command:
```
paddlecloud --help
```
-												Implement uploading data to PaddleCloud
1. Refine data_utils/data.py, reuse process_utterance function.
2. Modified README.
3. Implement uploading data in cloud/upload_data.py
4. Merge branch 'develop' of https://github.com/PaddlePaddle/models into ds2_pcloud

											
										
										
											8 years ago
+								# Run DS2 on PaddleCloud
-												Refine submitting scripts for deepspeech2 on paddle cloud.

											
										
										
											8 years ago
-												Fix some syntax errors.

											
										
										
											8 years ago
+								>Note:
 								>Make sure [PaddleCloud client](https://github.com/PaddlePaddle/cloud/blob/develop/doc/usage_cn.md#%E4%B8%8B%E8%BD%BD%E5%B9%B6%E9%85%8D%E7%BD%AEpaddlecloud) has be installed and current directory is `models/deep_speech_2/cloud/`
-												Refine submitting scripts for deepspeech2 on paddle cloud.

											
										
										
											8 years ago
-												Fix some syntax errors.

											
										
										
											8 years ago
+								## Step-1  Configure data set
-												Refine submitting scripts for deepspeech2 on paddle cloud.

											
										
										
											8 years ago
-												Fix some syntax errors.

											
										
										
											8 years ago
+								Configure your input data and output path in pcloud_submit.sh:
-												Refine submitting scripts for deepspeech2 on paddle cloud.

											
										
										
											8 years ago
-												Fix some syntax errors.

											
										
										
											8 years ago
+								-  `TRAIN_MANIFEST`： Absolute path of train data manifest file in local file system.This file has format as bellow:
-												Implement uploading data to PaddleCloud
1. Refine data_utils/data.py, reuse process_utterance function.
2. Modified README.
3. Implement uploading data in cloud/upload_data.py
4. Merge branch 'develop' of https://github.com/PaddlePaddle/models into ds2_pcloud

											
										
										
											8 years ago
 								```
 								{"audio_filepath": "/home/disk1/LibriSpeech/dev-clean/1272/128104/1272-128104-0000.flac", "duration": 5.855, "text
 								": "mister quilter is the ..."}
 								{"audio_filepath": "/home/disk1/LibriSpeech/dev-clean/1272/128104/1272-128104-0001.flac", "duration": 4.815, "text
 								": "nor is mister ..."}
 								```
-												Fix some syntax errors.

											
										
										
											8 years ago
+								- `TEST_MANIFEST`: Absolute path of train data manifest file in local filesystem. This file has format like `TRAIN_MANIFEST`.
-												Implement uploading data to PaddleCloud
1. Refine data_utils/data.py, reuse process_utterance function.
2. Modified README.
3. Implement uploading data in cloud/upload_data.py
4. Merge branch 'develop' of https://github.com/PaddlePaddle/models into ds2_pcloud

											
										
										
											8 years ago
+								- `VOCAB_FILE`:  Absolute path of vocabulary file in local filesytem.
-												Fix some syntax errors.

											
										
										
											8 years ago
+								- `MEAN_STD_FILE`: Absolute path of normalizer's statistic file in local filesytem.
-												Implement uploading data to PaddleCloud
1. Refine data_utils/data.py, reuse process_utterance function.
2. Modified README.
3. Implement uploading data in cloud/upload_data.py
4. Merge branch 'develop' of https://github.com/PaddlePaddle/models into ds2_pcloud

											
										
										
											8 years ago
+								- `CLOUD_DATA_DIR:`  Absolute path in PaddleCloud filesystem. We will upload local train data to this directory.
 								- `CLOUD_MODEL_DIR`: Absolute path in PaddleCloud filesystem. PaddleCloud trainer will save model to this directory.
-												Fix some syntax errors.

											
										
										
											8 years ago
+								>Note: Upload will be skipped if target file has existed in `CLOUD_DATA_DIR`.
-												Implement uploading data to PaddleCloud
1. Refine data_utils/data.py, reuse process_utterance function.
2. Modified README.
3. Implement uploading data in cloud/upload_data.py
4. Merge branch 'develop' of https://github.com/PaddlePaddle/models into ds2_pcloud

											
										
										
											8 years ago
-												Fix some syntax errors.

											
										
										
											8 years ago
+								## Step-2  Configure computation resource
-												Implement uploading data to PaddleCloud
1. Refine data_utils/data.py, reuse process_utterance function.
2. Modified README.
3. Implement uploading data in cloud/upload_data.py
4. Merge branch 'develop' of https://github.com/PaddlePaddle/models into ds2_pcloud

											
										
										
											8 years ago
-												Fix some syntax errors.

											
										
										
											8 years ago
+								Configure computation resource in pcloud_submit.sh:
-												Implement uploading data to PaddleCloud
1. Refine data_utils/data.py, reuse process_utterance function.
2. Modified README.
3. Implement uploading data in cloud/upload_data.py
4. Merge branch 'develop' of https://github.com/PaddlePaddle/models into ds2_pcloud

											
										
										
											8 years ago
 								```
 								# Configure computation resource and submit job to PaddleCloud
 								 paddlecloud submit \
 								 -image wanghaoshuang/pcloud_ds2:latest \
 								 -jobname ${JOB_NAME} \
 								 -cpu 4 \
 								 -gpu 4 \
 								 -memory 10Gi \
 								 -parallelism 1 \
 								 -pscpu 1 \
 								 -pservers 1 \
 								 -psmemory 10Gi \
 								 -passes 1 \
 								 -entry "sh pcloud_train.sh ${CLOUD_DATA_DIR} ${CLOUD_MODEL_DIR}" \
 								 ${DS2_PATH}
 								```
-												Fix some syntax errors.

											
										
										
											8 years ago
+								For more information, please refer to [PaddleCloud](https://github.com/PaddlePaddle/cloud/blob/develop/doc/usage_cn.md#提交任务)
-												Implement uploading data to PaddleCloud
1. Refine data_utils/data.py, reuse process_utterance function.
2. Modified README.
3. Implement uploading data in cloud/upload_data.py
4. Merge branch 'develop' of https://github.com/PaddlePaddle/models into ds2_pcloud

											
										
										
											8 years ago
-												Fix some syntax errors.

											
										
										
											8 years ago
+								## Step-3  Configure algorithm options
 								Configure algorithm options in pcloud_train.sh:
-												Implement uploading data to PaddleCloud
1. Refine data_utils/data.py, reuse process_utterance function.
2. Modified README.
3. Implement uploading data in cloud/upload_data.py
4. Merge branch 'develop' of https://github.com/PaddlePaddle/models into ds2_pcloud

											
										
										
											8 years ago
+								```
 								python train.py \
 								--use_gpu=1 \
 								--trainer_count=4 \
 								--batch_size=256 \
 								--mean_std_filepath=$MEAN_STD_FILE \
 								--train_manifest_path='./local.train.manifest' \
 								--dev_manifest_path='./local.test.manifest' \
 								--vocab_filepath=$VOCAB_PATH \
 								--output_model_dir=${MODEL_PATH}
 								```
 								You can get more information about algorithm options by follow command:
 								```
 								cd ..
 								python train.py --help
 								```
-												Fix some syntax errors.

											
										
										
											8 years ago
+								## Step-4  Submit job
-												Refine submitting scripts for deepspeech2 on paddle cloud.

											
										
										
											8 years ago
+								```
 								$ sh pcloud_submit.sh
 								```
-												Fix some syntax errors.

											
										
										
											8 years ago
+								## Step-5 Get logs
-												Refine submitting scripts for deepspeech2 on paddle cloud.

											
										
										
											8 years ago
+								```
 								$ paddlecloud logs -n 10000 deepspeech20170727130129
 								```
-												Implement uploading data to PaddleCloud
1. Refine data_utils/data.py, reuse process_utterance function.
2. Modified README.
3. Implement uploading data in cloud/upload_data.py
4. Merge branch 'develop' of https://github.com/PaddlePaddle/models into ds2_pcloud

											
										
										
											8 years ago
+								For more information, please refer to [PaddleCloud client](https://github.com/PaddlePaddle/cloud/blob/develop/doc/usage_cn.md#下载并配置paddlecloud) or get help by follow command:
 								```
 								paddlecloud --help
 								```