E2E/Streaming Transformer/Conformer ASR (#578)

* add cmvn and label smoothing loss layer * add layer for transformer * add glu and conformer conv * add torch compatiable hack, mask funcs * not hack size since it exists * add test; attention * add attention, common utils, hack paddle * add audio utils * conformer batch padding mask bug fix #223 * fix typo, python infer fix rnn mem opt name error and batchnorm1d, will be available at 2.0.2 * fix ci * fix ci * add encoder * refactor egs * add decoder * refactor ctc, add ctc align, refactor ckpt, add warmup lr scheduler, cmvn utils * refactor docs * add fix * fix readme * fix bugs, refactor collator, add pad_sequence, fix ckpt bugs * fix docstring * refactor data feed order * add u2 model * refactor cmvn, test * add utils * add u2 config * fix bugs * fix bugs * fix autograd maybe has problem when using inplace operation * refactor data, build vocab; add format data * fix text featurizer * refactor build vocab * add fbank, refactor feature of speech * refactor audio feat * refactor data preprare * refactor data * model init from config * add u2 bins * flake8 * can train * fix bugs, add coverage, add scripts * test can run * fix data * speed perturb with sox * add spec aug * fix for train * fix train logitc * fix logger * log valid loss, time dataset process * using np for speed perturb, remove some debug log of grad clip * fix logger * fix build vocab * fix logger name * using module logger as default * fix * fix install * reorder imports * fix board logger * fix logger * kaldi fbank and mfcc * fix cmvn and print prarams * fix add_eos_sos and cmvn * fix cmvn compute * fix logger and cmvn * fix subsampling, label smoothing loss, remove useless * add notebook test * fix log * fix tb logger * multi gpu valid * fix log * fix log * fix config * fix compute cmvn, need paddle 2.1 * add cmvn notebook * fix layer tools * fix compute cmvn * add rtf * fix decoding * fix layer tools * fix log, add avg script * more avg and test info * fix dataset pickle problem; using 2.1 paddle; num_workers can > 0; ckpt save in exp dir;fix setup.sh; * add vimrc * refactor tiny script, add transformer and stream conf * spm demo; librisppech scripts and confs * fix log * add librispeech scripts * refactor data pipe; fix conf; fix u2 default params * fix bugs * refactor aishell scripts * fix test * fix cmvn * fix s0 scripts * fix ds2 scripts and bugs * fix dev & test dataset filter * fix dataset filter * filter dev * fix ckpt path * filter test, since librispeech will cause OOM, but all test wer will be worse, since mismatch train with test * add comment * add syllable doc * fix ds2 configs * add doc * add pypinyin tools * fix decoder using blank_id=0 * mmseg with pybind11 * format code
4 years ago · 71e046b0ba
parent 3a2de9e461
commit 71e046b0ba
446 changed files with 1414633 additions and 2732 deletions
--- a/.clang-format
+++ b/.clang-format
@ -16,8 +16,8 @@
 ---
 Language:        Cpp
 BasedOnStyle:  Google
-IndentWidth:     2
-TabWidth:        2
+IndentWidth:     4
+TabWidth:        4
 ContinuationIndentWidth: 4
 MaxEmptyLinesToKeep: 2
 AccessModifierOffset: -2  # The private/protected/public has no indent in class
--- a/.flake8
+++ b/.flake8
@ -0,0 +1,50 @@
+[flake8]
+
+########## OPTIONS ##########
+# Set the maximum length that any line (with some exceptions) may be.
+max-line-length = 120
+
+
+################### FILE PATTERNS ##########################
+# Provide a comma-separated list of glob patterns to exclude from checks.
+exclude =
+    # git folder
+    .git,
+    # python cache
+    __pycache__,
+    third_party/,
+# Provide a comma-separate list of glob patterns to include for checks.
+filename =
+    *.py
+
+
+########## RULES ##########
+
+# ERROR CODES
+#
+# E/W  - PEP8 errors/warnings (pycodestyle)
+# F    - linting errors (pyflakes)
+# C    - McCabe complexity error (mccabe)
+#
+# W503 - line break before binary operator
+
+# Specify a list of codes to ignore.
+ignore =
+    W503
+    E252,E262,E127,E265,E126,E266,E241,E261,E128,E125
+    W291,W293,W605
+    E203,E305,E402,E501,E721,E741,F403,F405,F821,F841,F999,W503,W504,C408,E302,W291,E303,
+    # shebang has extra meaning in fbcode lints, so I think it's not worth trying
+    # to line this up with executable bit
+    EXE001,
+    # these ignores are from flake8-bugbear; please fix!
+    B007,B008,
+    # these ignores are from flake8-comprehensions; please fix!
+    C400,C401,C402,C403,C404,C405,C407,C411,C413,C414,C415
+
+# Specify the list of error codes you wish Flake8 to report.
+select =
+    E,
+    W,
+    F,
+    C
--- a/.gitconfig
+++ b/.gitconfig
@ -0,0 +1,48 @@
+[alias]
+  st = status
+  ci = commit
+  br = branch
+  co = checkout
+  df = diff
+  l = log --pretty=format:\"%h %ad | %s%d [%an]\" --graph --date=short
+  ll = log --stat
+
+[merge]
+  tool = vimdiff
+
+[core]
+  excludesfile = ~/.gitignore
+  editor = vim
+
+[color]
+  branch = auto
+  diff = auto
+  status = auto
+
+[color "branch"]
+  current = yellow reverse
+  local = yellow
+  remote = green
+
+[color "diff"]
+  meta = yellow bold
+  frag = magenta bold
+  old = red bold
+  new = green bold
+
+[color "status"]
+  added = yellow
+  changed = green
+  untracked = cyan
+
+[push]
+  default = matching
+
+[credential]
+  helper = store
+
+[user]
+  name =
+  email =
+
+
--- a/.gitignore
+++ b/.gitignore
@ -5,3 +5,8 @@ tools/venv
 *.log
 *.pdmodel
 *.pdiparams*
+*.zip
+*.tar
+*.tar.gz
+.ipynb_checkpoints
+*.npz
--- a/.notebook/Linear_test.ipynb
+++ b/.notebook/Linear_test.ipynb
@ -0,0 +1,605 @@
+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "id": "academic-surname",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import paddle\n",
+    "from paddle import nn"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "id": "fundamental-treasure",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "/workspace/DeepSpeech-2.x/tools/venv-dev/lib/python3.7/site-packages/ipykernel/ipkernel.py:283: DeprecationWarning: `should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.\n",
+      "  and should_run_async(code)\n"
+     ]
+    }
+   ],
+   "source": [
+    "L = nn.Linear(256, 2048)\n",
+    "L2 = nn.Linear(2048, 256)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "id": "consolidated-elephant",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import numpy as np\n",
+    "import torch\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "id": "moderate-noise",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "float64\n",
+      "Tensor(shape=[2, 51, 256], dtype=float32, place=CUDAPlace(0), stop_gradient=True,\n",
+      "       [[[-1.54171216, -2.61531472, -1.79881978, ..., -0.31395876,  0.56513089, -0.44516513],\n",
+      "         [-0.79492962,  1.91157901,  0.66567147, ...,  0.54825783, -1.01471853, -0.84924090],\n",
+      "         [-1.22556651, -0.36225814,  0.65063190, ...,  0.65726501,  0.05563191,  0.09009409],\n",
+      "         ...,\n",
+      "         [ 0.38615900, -0.77905393,  0.99732304, ..., -1.38463700, -3.32365036, -1.31089687],\n",
+      "         [ 0.05579993,  0.06885809, -1.66662002, ..., -0.23346378, -3.29372883,  1.30561364],\n",
+      "         [ 1.90676069,  1.95093191, -0.28849599, ..., -0.06860496,  0.95347673,  1.00475824]],\n",
+      "\n",
+      "        [[-0.91453546,  0.55298805, -1.06146812, ..., -0.86378336,  1.00454640,  1.26062179],\n",
+      "         [ 0.10223761,  0.81301165,  2.36865163, ...,  0.16821407,  0.29240361,  1.05408621],\n",
+      "         [-1.33196676,  1.94433689,  0.01934209, ...,  0.48036841,  0.51585966,  1.22893548],\n",
+      "         ...,\n",
+      "         [-0.19558455, -0.47075930,  0.90796155, ..., -1.28598249, -0.24321797,  0.17734711],\n",
+      "         [ 0.89819717, -1.39516675,  0.17138045, ...,  2.39761519,  1.76364994, -0.52177650],\n",
+      "         [ 0.94122332, -0.18581429,  1.36099780, ...,  0.67647684, -0.04699665,  1.51205540]]])\n",
+      "tensor([[[-1.5417, -2.6153, -1.7988,  ..., -0.3140,  0.5651, -0.4452],\n",
+      "         [-0.7949,  1.9116,  0.6657,  ...,  0.5483, -1.0147, -0.8492],\n",
+      "         [-1.2256, -0.3623,  0.6506,  ...,  0.6573,  0.0556,  0.0901],\n",
+      "         ...,\n",
+      "         [ 0.3862, -0.7791,  0.9973,  ..., -1.3846, -3.3237, -1.3109],\n",
+      "         [ 0.0558,  0.0689, -1.6666,  ..., -0.2335, -3.2937,  1.3056],\n",
+      "         [ 1.9068,  1.9509, -0.2885,  ..., -0.0686,  0.9535,  1.0048]],\n",
+      "\n",
+      "        [[-0.9145,  0.5530, -1.0615,  ..., -0.8638,  1.0045,  1.2606],\n",
+      "         [ 0.1022,  0.8130,  2.3687,  ...,  0.1682,  0.2924,  1.0541],\n",
+      "         [-1.3320,  1.9443,  0.0193,  ...,  0.4804,  0.5159,  1.2289],\n",
+      "         ...,\n",
+      "         [-0.1956, -0.4708,  0.9080,  ..., -1.2860, -0.2432,  0.1773],\n",
+      "         [ 0.8982, -1.3952,  0.1714,  ...,  2.3976,  1.7636, -0.5218],\n",
+      "         [ 0.9412, -0.1858,  1.3610,  ...,  0.6765, -0.0470,  1.5121]]])\n"
+     ]
+    },
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "/workspace/DeepSpeech-2.x/tools/venv-dev/lib/python3.7/site-packages/ipykernel/ipkernel.py:283: DeprecationWarning: `should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.\n",
+      "  and should_run_async(code)\n"
+     ]
+    }
+   ],
+   "source": [
+    "x = np.random.randn(2, 51, 256)\n",
+    "print(x.dtype)\n",
+    "px = paddle.to_tensor(x, dtype='float32')\n",
+    "tx = torch.tensor(x, dtype=torch.float32)\n",
+    "print(px)\n",
+    "print(tx)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "cooked-progressive",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "id": "mechanical-prisoner",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "data = np.load('enc_0_ff_out.npz', allow_pickle=True)\n",
+    "t_norm_ff = data['norm_ff']\n",
+    "t_ff_out = data['ff_out']\n",
+    "t_ff_l_x = data['ff_l_x']\n",
+    "t_ff_l_a_x = data['ff_l_a_x']\n",
+    "t_ff_l_a_l_x = data['ff_l_a_l_x']\n",
+    "t_ps = data['ps']"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "indie-marriage",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "id": "assured-zambia",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "True\n",
+      "True\n",
+      "True\n",
+      "True\n"
+     ]
+    }
+   ],
+   "source": [
+    "L.set_state_dict({'weight': t_ps[0].T, 'bias': t_ps[1]})\n",
+    "L2.set_state_dict({'weight': t_ps[2].T, 'bias': t_ps[3]})\n",
+    "\n",
+    "ps = []\n",
+    "for n, p in L.named_parameters():\n",
+    "   ps.append(p)\n",
+    "\n",
+    "for n, p in L2.state_dict().items():\n",
+    "    ps.append(p)\n",
+    "    \n",
+    "for p, tp in zip(ps, t_ps):\n",
+    "    print(np.allclose(p.numpy(), tp.T))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "committed-jacob",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "extreme-traffic",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "optimum-milwaukee",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "id": "viral-indian",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "True\n",
+      "True\n",
+      "True\n",
+      "True\n"
+     ]
+    }
+   ],
+   "source": [
+    "# data = np.load('enc_0_ff_out.npz', allow_pickle=True)\n",
+    "# t_norm_ff = data['norm_ff']\n",
+    "# t_ff_out = data['ff_out']\n",
+    "# t_ff_l_x = data['ff_l_x']\n",
+    "# t_ff_l_a_x = data['ff_l_a_x']\n",
+    "# t_ff_l_a_l_x = data['ff_l_a_l_x']\n",
+    "# t_ps = data['ps']\n",
+    "TL = torch.nn.Linear(256, 2048)\n",
+    "TL2 = torch.nn.Linear(2048, 256)\n",
+    "TL.load_state_dict({'weight': torch.tensor(t_ps[0]), 'bias': torch.tensor(t_ps[1])})\n",
+    "TL2.load_state_dict({'weight': torch.tensor(t_ps[2]), 'bias': torch.tensor(t_ps[3])})\n",
+    "\n",
+    "# for n, p in TL.named_parameters():\n",
+    "#    print(n, p)\n",
+    "# for n, p in TL2.named_parameters():\n",
+    "#    print(n, p)\n",
+    "\n",
+    "ps = []\n",
+    "for n, p in TL.state_dict().items():\n",
+    "    ps.append(p.data.numpy())\n",
+    "    \n",
+    "for n, p in TL2.state_dict().items():\n",
+    "    ps.append(p.data.numpy())\n",
+    "    \n",
+    "for p, tp in zip(ps, t_ps):\n",
+    "    print(np.allclose(p, tp))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "id": "skilled-vietnamese",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "[[[ 0.67277956  0.08313607 -0.62761104 ... -0.17480263  0.42718208\n",
+      "   -0.5787626 ]\n",
+      "  [ 0.91516656  0.5393416   1.7159258  ...  0.06144593  0.06486575\n",
+      "   -0.03350811]\n",
+      "  [ 0.438351    0.6227843   0.24096036 ...  1.0912522  -0.90929437\n",
+      "   -1.012989  ]\n",
+      "  ...\n",
+      "  [ 0.68631977  0.14240924  0.10763275 ... -0.11513516  0.48065388\n",
+      "    0.04070369]\n",
+      "  [-0.9525228   0.23197874  0.31264272 ...  0.5312439   0.18773697\n",
+      "   -0.8450228 ]\n",
+      "  [ 0.42024016 -0.04561988  0.54541194 ... -0.41933843 -0.00436018\n",
+      "   -0.06663495]]\n",
+      "\n",
+      " [[-0.11638781 -0.33566502 -0.20887226 ...  0.17423287 -0.9195841\n",
+      "   -0.8161046 ]\n",
+      "  [-0.3469874   0.88269687 -0.11887559 ... -0.15566081  0.16357468\n",
+      "   -0.20766167]\n",
+      "  [-0.3847657   0.3984318  -0.06963477 ... -0.00360622  1.2360432\n",
+      "   -0.26811332]\n",
+      "  ...\n",
+      "  [ 0.08230796 -0.46158582  0.54582864 ...  0.15747628 -0.44790155\n",
+      "    0.06020184]\n",
+      "  [-0.8095085   0.43163058 -0.42837143 ...  0.8627463   0.90656304\n",
+      "    0.15847842]\n",
+      "  [-1.485811   -0.18216592 -0.8882585  ...  0.32596245  0.7822631\n",
+      "   -0.6460344 ]]]\n",
+      "[[[ 0.67278004  0.08313602 -0.6276114  ... -0.17480245  0.42718196\n",
+      "   -0.5787625 ]\n",
+      "  [ 0.91516703  0.5393413   1.7159253  ...  0.06144581  0.06486579\n",
+      "   -0.03350812]\n",
+      "  [ 0.43835106  0.62278455  0.24096027 ...  1.0912521  -0.9092943\n",
+      "   -1.0129892 ]\n",
+      "  ...\n",
+      "  [ 0.6863195   0.14240888  0.10763284 ... -0.11513527  0.48065376\n",
+      "    0.04070365]\n",
+      "  [-0.9525231   0.23197863  0.31264275 ...  0.53124386  0.18773702\n",
+      "   -0.84502304]\n",
+      "  [ 0.42024007 -0.04561983  0.545412   ... -0.41933888 -0.00436005\n",
+      "   -0.066635  ]]\n",
+      "\n",
+      " [[-0.11638767 -0.33566508 -0.20887226 ...  0.17423296 -0.9195838\n",
+      "   -0.8161046 ]\n",
+      "  [-0.34698725  0.88269705 -0.11887549 ... -0.15566081  0.16357464\n",
+      "   -0.20766166]\n",
+      "  [-0.3847657   0.3984319  -0.06963488 ... -0.00360619  1.2360426\n",
+      "   -0.26811326]\n",
+      "  ...\n",
+      "  [ 0.08230786 -0.4615857   0.5458287  ...  0.15747619 -0.44790167\n",
+      "    0.06020182]\n",
+      "  [-0.8095083   0.4316307  -0.42837155 ...  0.862746    0.9065631\n",
+      "    0.15847899]\n",
+      "  [-1.485811   -0.18216613 -0.8882584  ...  0.32596254  0.7822631\n",
+      "   -0.6460344 ]]]\n",
+      "True\n",
+      "False\n"
+     ]
+    }
+   ],
+   "source": [
+    "y = L(px)\n",
+    "print(y.numpy())\n",
+    "\n",
+    "ty = TL(tx)\n",
+    "print(ty.data.numpy())\n",
+    "print(np.allclose(px.numpy(), tx.detach().numpy()))\n",
+    "print(np.allclose(y.numpy(), ty.detach().numpy()))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "incorrect-allah",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "prostate-cameroon",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "id": "governmental-surge",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "[[ 0.04476918  0.554463   -0.3027508  ... -0.49600336  0.3751858\n",
+      "   0.8254095 ]\n",
+      " [ 0.95594174 -0.29528382 -1.2899452  ...  0.43718258  0.05584608\n",
+      "  -0.06974669]]\n",
+      "[[ 0.04476918  0.5544631  -0.3027507  ... -0.49600336  0.37518573\n",
+      "   0.8254096 ]\n",
+      " [ 0.95594174 -0.29528376 -1.2899454  ...  0.4371827   0.05584623\n",
+      "  -0.0697467 ]]\n",
+      "True\n",
+      "False\n",
+      "True\n"
+     ]
+    }
+   ],
+   "source": [
+    "x = np.random.randn(2, 256)\n",
+    "px = paddle.to_tensor(x, dtype='float32')\n",
+    "tx = torch.tensor(x, dtype=torch.float32)\n",
+    "y = L(px)\n",
+    "print(y.numpy())\n",
+    "ty = TL(tx)\n",
+    "print(ty.data.numpy())\n",
+    "print(np.allclose(px.numpy(), tx.detach().numpy()))\n",
+    "print(np.allclose(y.numpy(), ty.detach().numpy()))\n",
+    "print(np.allclose(y.numpy(), ty.detach().numpy(), atol=1e-5))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "confidential-jacket",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 10,
+   "id": "improved-civilization",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "5e7e7c9fde8350084abf1898cf52651cfc84b17a\n"
+     ]
+    }
+   ],
+   "source": [
+    "print(paddle.version.commit)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 11,
+   "id": "d1e2d3b4",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "['__builtins__',\n",
+       " '__cached__',\n",
+       " '__doc__',\n",
+       " '__file__',\n",
+       " '__loader__',\n",
+       " '__name__',\n",
+       " '__package__',\n",
+       " '__spec__',\n",
+       " 'commit',\n",
+       " 'full_version',\n",
+       " 'istaged',\n",
+       " 'major',\n",
+       " 'minor',\n",
+       " 'mkl',\n",
+       " 'patch',\n",
+       " 'rc',\n",
+       " 'show',\n",
+       " 'with_mkl']"
+      ]
+     },
+     "execution_count": 11,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "dir(paddle.version)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 12,
+   "id": "c880c719",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "2.1.0\n"
+     ]
+    }
+   ],
+   "source": [
+    "print(paddle.version.full_version)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 13,
+   "id": "f26977bf",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "commit: 5e7e7c9fde8350084abf1898cf52651cfc84b17a\n",
+      "None\n"
+     ]
+    }
+   ],
+   "source": [
+    "print(paddle.version.show())"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 14,
+   "id": "04ad47f6",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "1.6.0\n"
+     ]
+    }
+   ],
+   "source": [
+    "print(torch.__version__)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 15,
+   "id": "e1e03830",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "['__builtins__',\n",
+       " '__cached__',\n",
+       " '__doc__',\n",
+       " '__file__',\n",
+       " '__loader__',\n",
+       " '__name__',\n",
+       " '__package__',\n",
+       " '__spec__',\n",
+       " '__version__',\n",
+       " 'cuda',\n",
+       " 'debug',\n",
+       " 'git_version',\n",
+       " 'hip']"
+      ]
+     },
+     "execution_count": 15,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "dir(torch.version)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 19,
+   "id": "4ad0389b",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "'b31f58de6fa8bbda5353b3c77d9be4914399724d'"
+      ]
+     },
+     "execution_count": 19,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "torch.version.git_version"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 21,
+   "id": "7870ea10",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "'10.2'"
+      ]
+     },
+     "execution_count": 21,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "torch.version.cuda"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "db8ee5a7",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "6321ec2a",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.7.0"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
--- a/.notebook/compute_cmvn_loader_test.ipynb
+++ b/.notebook/compute_cmvn_loader_test.ipynb
--- a/.notebook/dataloader.ipynb
+++ b/.notebook/dataloader.ipynb
@ -338,7 +338,7 @@
    }
   ],
   "source": [
-    "for idx, (audio, text, audio_len, text_len) in enumerate(batch_reader()):\n",
+    "for idx, (audio, audio_len, text, text_len) in enumerate(batch_reader()):\n",
    "    print('test', text)\n",
    "    print(\"test raw\", ''.join( chr(i) for i in text[0][:int(text_len[0])] ))\n",
    "    print(\"test raw\", ''.join( chr(i) for i in text[-1][:int(text_len[-1])] ))\n",
@ -386,4 +386,4 @@
 },
 "nbformat": 4,
 "nbformat_minor": 5
-}
+}
--- a/.notebook/dataloader_with_tokens_tokenids.ipynb
+++ b/.notebook/dataloader_with_tokens_tokenids.ipynb
--- a/.notebook/hack_api_test.ipynb
+++ b/.notebook/hack_api_test.ipynb
@ -0,0 +1,290 @@
+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "id": "breeding-haven",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "/home/ssd5/zhanghui/DeepSpeech2.x\n"
+     ]
+    },
+    {
+     "data": {
+      "text/plain": [
+       "'/home/ssd5/zhanghui/DeepSpeech2.x'"
+      ]
+     },
+     "execution_count": 1,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "%cd ..\n",
+    "%pwd"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "id": "appropriate-theta",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "LICENSE       deepspeech  examples\t\t    requirements.txt  tools\r\n",
+      "README.md     docs\t  libsndfile-1.0.28\t    setup.sh\t      utils\r\n",
+      "README_cn.md  env.sh\t  libsndfile-1.0.28.tar.gz  tests\r\n"
+     ]
+    }
+   ],
+   "source": [
+    "!ls"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "id": "entire-bloom",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "/home/ssd5/zhanghui/DeepSpeech2.x/tools/venv/lib/python3.7/site-packages/paddle/fluid/layers/utils.py:26: DeprecationWarning: `np.int` is a deprecated alias for the builtin `int`. To silence this warning, use `int` by itself. Doing this will not modify any behavior and is safe. When replacing `np.int`, you may wish to use e.g. `np.int64` or `np.int32` to specify the precision. If you wish to review your current use, check the release note link for additional information.\n",
+      "Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations\n",
+      "  def convert_to_list(value, n, name, dtype=np.int):\n",
+      "WARNING:root:override cat of paddle.Tensor if exists or register, remove this when fixed!\n",
+      "WARNING:root:register user masked_fill to paddle.Tensor, remove this when fixed!\n",
+      "WARNING:root:register user masked_fill_ to paddle.Tensor, remove this when fixed!\n",
+      "WARNING:root:register user repeat to paddle.Tensor, remove this when fixed!\n",
+      "WARNING:root:register user glu to paddle.nn.functional, remove this when fixed!\n",
+      "WARNING:root:register user GLU to paddle.nn, remove this when fixed!\n",
+      "WARNING:root:register user ConstantPad2d to paddle.nn, remove this when fixed!\n",
+      "WARNING:root:override ctc_loss of paddle.nn.functional if exists, remove this when fixed!\n"
+     ]
+    }
+   ],
+   "source": [
+    "from deepspeech.modules import loss"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "id": "governmental-aircraft",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "/home/ssd5/zhanghui/DeepSpeech2.x/tools/venv/lib/python3.7/site-packages/ipykernel/ipkernel.py:283: DeprecationWarning: `should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.\n",
+      "  and should_run_async(code)\n"
+     ]
+    }
+   ],
+   "source": [
+    "import paddle"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "id": "proprietary-disaster",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "<function deepspeech.modules.repeat(xs: paddle.VarBase, *size: Any) -> paddle.VarBase>"
+      ]
+     },
+     "execution_count": 5,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "paddle.Tensor.repeat"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "id": "first-diagram",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "<property at 0x7fb515eeeb88>"
+      ]
+     },
+     "execution_count": 6,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "paddle.Tensor.size"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "id": "intelligent-david",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "<function paddle.tensor.manipulation.concat(x, axis=0, name=None)>"
+      ]
+     },
+     "execution_count": 7,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "paddle.Tensor.cat"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 12,
+   "id": "bronze-tenant",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "a = paddle.to_tensor([12,32, 10, 12, 123,32 ,4])"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 13,
+   "id": "balanced-bearing",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "7"
+      ]
+     },
+     "execution_count": 13,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "a.size"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 20,
+   "id": "extreme-republic",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def size(xs: paddle.Tensor, *args: int) -> paddle.Tensor:\n",
+    "    nargs = len(args)\n",
+    "    assert (nargs <= 1)\n",
+    "    s = paddle.shape(xs)\n",
+    "    if nargs == 1:\n",
+    "        return s[args[0]]\n",
+    "    else:\n",
+    "        return s\n",
+    "\n",
+    "# logger.warn(\n",
+    "#     \"override size of paddle.Tensor if exists or register, remove this when fixed!\"\n",
+    "# )\n",
+    "paddle.Tensor.size = size"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 21,
+   "id": "gross-addiction",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "Tensor(shape=[1], dtype=int32, place=CPUPlace, stop_gradient=True,\n",
+       "       [7])"
+      ]
+     },
+     "execution_count": 21,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "a.size(0)\n",
+    "a.size()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 22,
+   "id": "adverse-dining",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "Tensor(shape=[1], dtype=int32, place=CPUPlace, stop_gradient=True,\n",
+       "       [7])"
+      ]
+     },
+     "execution_count": 22,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "a.size()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "popular-potato",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.7.0"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
--- a/.notebook/jit_infer.ipynb
+++ b/.notebook/jit_infer.ipynb
@ -0,0 +1,672 @@
+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "/home/ssd5/zhanghui/DeepSpeech2.x\n"
+     ]
+    },
+    {
+     "data": {
+      "text/plain": [
+       "'/home/ssd5/zhanghui/DeepSpeech2.x'"
+      ]
+     },
+     "execution_count": 1,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "%cd ..\n",
+    "%pwd"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "2021-03-26 02:55:23,873 - WARNING - register user softmax to paddle, remove this when fixed!\n",
+      "2021-03-26 02:55:23,875 - WARNING - register user sigmoid to paddle, remove this when fixed!\n",
+      "2021-03-26 02:55:23,875 - WARNING - register user relu to paddle, remove this when fixed!\n",
+      "2021-03-26 02:55:23,876 - WARNING - override cat of paddle if exists or register, remove this when fixed!\n",
+      "2021-03-26 02:55:23,876 - WARNING - override eq of paddle.Tensor if exists or register, remove this when fixed!\n",
+      "2021-03-26 02:55:23,877 - WARNING - override contiguous of paddle.Tensor if exists or register, remove this when fixed!\n",
+      "2021-03-26 02:55:23,877 - WARNING - override size of paddle.Tensor (`to_static` do not process `size` property, maybe some `paddle` api dependent on it), remove this when fixed!\n",
+      "2021-03-26 02:55:23,878 - WARNING - register user view to paddle.Tensor, remove this when fixed!\n",
+      "2021-03-26 02:55:23,878 - WARNING - register user view_as to paddle.Tensor, remove this when fixed!\n",
+      "2021-03-26 02:55:23,879 - WARNING - register user masked_fill to paddle.Tensor, remove this when fixed!\n",
+      "2021-03-26 02:55:23,880 - WARNING - register user masked_fill_ to paddle.Tensor, remove this when fixed!\n",
+      "2021-03-26 02:55:23,880 - WARNING - register user fill_ to paddle.Tensor, remove this when fixed!\n",
+      "2021-03-26 02:55:23,881 - WARNING - register user repeat to paddle.Tensor, remove this when fixed!\n",
+      "2021-03-26 02:55:23,881 - WARNING - register user softmax to paddle.Tensor, remove this when fixed!\n",
+      "2021-03-26 02:55:23,882 - WARNING - register user sigmoid to paddle.Tensor, remove this when fixed!\n",
+      "2021-03-26 02:55:23,882 - WARNING - register user relu to paddle.Tensor, remove this when fixed!\n",
+      "2021-03-26 02:55:23,883 - WARNING - register user glu to paddle.nn.functional, remove this when fixed!\n",
+      "2021-03-26 02:55:23,883 - WARNING - override ctc_loss of paddle.nn.functional if exists, remove this when fixed!\n",
+      "2021-03-26 02:55:23,884 - WARNING - register user GLU to paddle.nn, remove this when fixed!\n",
+      "2021-03-26 02:55:23,884 - WARNING - register user ConstantPad2d to paddle.nn, remove this when fixed!\n",
+      "/home/ssd5/zhanghui/DeepSpeech2.x/tools/venv-dev/lib/python3.7/site-packages/scipy/fftpack/__init__.py:103: DeprecationWarning: The module numpy.dual is deprecated.  Instead of using dual, use the functions directly from numpy or scipy.\n",
+      "  from numpy.dual import register_func\n",
+      "/home/ssd5/zhanghui/DeepSpeech2.x/tools/venv-dev/lib/python3.7/site-packages/scipy/special/orthogonal.py:81: DeprecationWarning: `np.int` is a deprecated alias for the builtin `int`. To silence this warning, use `int` by itself. Doing this will not modify any behavior and is safe. When replacing `np.int`, you may wish to use e.g. `np.int64` or `np.int32` to specify the precision. If you wish to review your current use, check the release note link for additional information.\n",
+      "Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations\n",
+      "  from numpy import (exp, inf, pi, sqrt, floor, sin, cos, around, int,\n"
+     ]
+    }
+   ],
+   "source": [
+    "import os\n",
+    "import time\n",
+    "import argparse\n",
+    "import functools\n",
+    "import paddle\n",
+    "import numpy as np\n",
+    "\n",
+    "from deepspeech.utils.socket_server import warm_up_test\n",
+    "from deepspeech.utils.socket_server import AsrTCPServer\n",
+    "from deepspeech.utils.socket_server import AsrRequestHandler\n",
+    "\n",
+    "from deepspeech.training.cli import default_argument_parser\n",
+    "from deepspeech.exps.deepspeech2.config import get_cfg_defaults\n",
+    "\n",
+    "from deepspeech.frontend.utility import read_manifest\n",
+    "from deepspeech.utils.utility import add_arguments, print_arguments\n",
+    "\n",
+    "from deepspeech.models.deepspeech2 import DeepSpeech2Model\n",
+    "from deepspeech.models.deepspeech2 import DeepSpeech2InferModel\n",
+    "from deepspeech.io.dataset import ManifestDataset\n",
+    "\n",
+    "\n",
+    "\n",
+    "from deepspeech.frontend.utility import read_manifest"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "0.0.0\n",
+      "e7f28d6c0db54eb9c9a810612300b526687e56a6\n",
+      "OFF\n",
+      "OFF\n",
+      "commit: e7f28d6c0db54eb9c9a810612300b526687e56a6\n",
+      "None\n",
+      "0\n"
+     ]
+    },
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "/home/ssd5/zhanghui/DeepSpeech2.x/tools/venv-dev/lib/python3.7/site-packages/ipykernel/ipkernel.py:283: DeprecationWarning: `should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.\n",
+      "  and should_run_async(code)\n"
+     ]
+    },
+    {
+     "data": {
+      "text/plain": [
+       "['__builtins__',\n",
+       " '__cached__',\n",
+       " '__doc__',\n",
+       " '__file__',\n",
+       " '__loader__',\n",
+       " '__name__',\n",
+       " '__package__',\n",
+       " '__spec__',\n",
+       " 'commit',\n",
+       " 'full_version',\n",
+       " 'istaged',\n",
+       " 'major',\n",
+       " 'minor',\n",
+       " 'mkl',\n",
+       " 'patch',\n",
+       " 'rc',\n",
+       " 'show',\n",
+       " 'with_mkl']"
+      ]
+     },
+     "execution_count": 3,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "print(paddle.__version__)\n",
+    "print(paddle.version.commit)\n",
+    "print(paddle.version.with_mkl)\n",
+    "print(paddle.version.mkl())\n",
+    "print(paddle.version.show())\n",
+    "print(paddle.version.patch)\n",
+    "dir(paddle.version)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "data:\n",
+      "  augmentation_config: conf/augmentation.config\n",
+      "  batch_size: 64\n",
+      "  dev_manifest: data/manifest.dev\n",
+      "  keep_transcription_text: False\n",
+      "  max_duration: 27.0\n",
+      "  max_freq: None\n",
+      "  mean_std_filepath: examples/aishell/data/mean_std.npz\n",
+      "  min_duration: 0.0\n",
+      "  n_fft: None\n",
+      "  num_workers: 0\n",
+      "  random_seed: 0\n",
+      "  shuffle_method: batch_shuffle\n",
+      "  sortagrad: True\n",
+      "  specgram_type: linear\n",
+      "  stride_ms: 10.0\n",
+      "  target_dB: -20\n",
+      "  target_sample_rate: 16000\n",
+      "  test_manifest: examples/aishell/data/manifest.test\n",
+      "  train_manifest: data/manifest.train\n",
+      "  use_dB_normalization: True\n",
+      "  vocab_filepath: examples/aishell/data/vocab.txt\n",
+      "  window_ms: 20.0\n",
+      "decoding:\n",
+      "  alpha: 2.6\n",
+      "  batch_size: 128\n",
+      "  beam_size: 300\n",
+      "  beta: 5.0\n",
+      "  cutoff_prob: 0.99\n",
+      "  cutoff_top_n: 40\n",
+      "  decoding_method: ctc_beam_search\n",
+      "  error_rate_type: cer\n",
+      "  lang_model_path: data/lm/zh_giga.no_cna_cmn.prune01244.klm\n",
+      "  num_proc_bsearch: 10\n",
+      "model:\n",
+      "  num_conv_layers: 2\n",
+      "  num_rnn_layers: 3\n",
+      "  rnn_layer_size: 1024\n",
+      "  share_rnn_weights: False\n",
+      "  use_gru: True\n",
+      "training:\n",
+      "  global_grad_clip: 5.0\n",
+      "  lr: 0.0005\n",
+      "  lr_decay: 0.83\n",
+      "  n_epoch: 30\n",
+      "  weight_decay: 1e-06\n",
+      "-----------  Configuration Arguments -----------\n",
+      "checkpoint_path: examples/aishell/ckpt-loss2e-3-0.83-5/checkpoints/step-11725\n",
+      "config: examples/aishell/conf/deepspeech2.yaml\n",
+      "device: gpu\n",
+      "dump_config: None\n",
+      "export_path: None\n",
+      "host_ip: localhost\n",
+      "host_port: 8086\n",
+      "model_dir: None\n",
+      "model_file: examples/aishell/jit.model.pdmodel\n",
+      "nprocs: 1\n",
+      "opts: ['data.test_manifest', 'examples/aishell/data/manifest.test', 'data.mean_std_filepath', 'examples/aishell/data/mean_std.npz', 'data.vocab_filepath', 'examples/aishell/data/vocab.txt']\n",
+      "output: None\n",
+      "params_file: examples/aishell/jit.model.pdiparams\n",
+      "speech_save_dir: demo_cache\n",
+      "use_gpu: False\n",
+      "warmup_manifest: examples/aishell/data/manifest.test\n",
+      "------------------------------------------------\n"
+     ]
+    }
+   ],
+   "source": [
+    "parser = default_argument_parser()\n",
+    "add_arg = functools.partial(add_arguments, argparser=parser)\n",
+    "add_arg('host_ip',          str,\n",
+    "        'localhost',\n",
+    "        \"Server's IP address.\")\n",
+    "add_arg('host_port',        int,    8086,    \"Server's IP port.\")\n",
+    "add_arg('speech_save_dir',  str,\n",
+    "        'demo_cache',\n",
+    "        \"Directory to save demo audios.\")\n",
+    "add_arg('warmup_manifest',  \n",
+    "        str, \n",
+    "        \"examples/aishell/data/manifest.test\", \n",
+    "        \"Filepath of manifest to warm up.\")\n",
+    "add_arg(\n",
+    "    \"--model_file\",\n",
+    "    type=str,\n",
+    "    default=\"examples/aishell/jit.model.pdmodel\",\n",
+    "    help=\"Model filename, Specify this when your model is a combined model.\"\n",
+    ")\n",
+    "add_arg(\n",
+    "    \"--params_file\",\n",
+    "    type=str,\n",
+    "    default=\"examples/aishell/jit.model.pdiparams\",\n",
+    "    help=\n",
+    "    \"Parameter filename, Specify this when your model is a combined model.\"\n",
+    ")\n",
+    "add_arg(\n",
+    "    \"--model_dir\",\n",
+    "    type=str,\n",
+    "    default=None,\n",
+    "    help=\n",
+    "    \"Model dir, If you load a non-combined model, specify the directory of the model.\"\n",
+    ")\n",
+    "add_arg(\"--use_gpu\",type=bool,default=False, help=\"Whether use gpu.\")\n",
+    "\n",
+    "\n",
+    "args = parser.parse_args(\n",
+    "    \"--checkpoint_path examples/aishell/ckpt-loss2e-3-0.83-5/checkpoints/step-11725 --config examples/aishell/conf/deepspeech2.yaml --opts data.test_manifest examples/aishell/data/manifest.test data.mean_std_filepath examples/aishell/data/mean_std.npz  data.vocab_filepath examples/aishell/data/vocab.txt\".split()\n",
+    ")\n",
+    "\n",
+    "\n",
+    "config = get_cfg_defaults()\n",
+    "if args.config:\n",
+    "    config.merge_from_file(args.config)\n",
+    "if args.opts:\n",
+    "    config.merge_from_list(args.opts)\n",
+    "config.freeze()\n",
+    "print(config)\n",
+    "\n",
+    "args.warmup_manifest = config.data.test_manifest\n",
+    "\n",
+    "print_arguments(args)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "dataset = ManifestDataset(\n",
+    "        config.data.test_manifest,\n",
+    "        config.data.unit_type,\n",
+    "        config.data.vocab_filepath,\n",
+    "        config.data.mean_std_filepath,\n",
+    "        augmentation_config=\"{}\",\n",
+    "        max_duration=config.data.max_duration,\n",
+    "        min_duration=config.data.min_duration,\n",
+    "        stride_ms=config.data.stride_ms,\n",
+    "        window_ms=config.data.window_ms,\n",
+    "        n_fft=config.data.n_fft,\n",
+    "        max_freq=config.data.max_freq,\n",
+    "        target_sample_rate=config.data.target_sample_rate,\n",
+    "        specgram_type=config.data.specgram_type,\n",
+    "        feat_dim=config.data.feat_dim,\n",
+    "        delta_delta=config.data.delat_delta,\n",
+    "        use_dB_normalization=config.data.use_dB_normalization,\n",
+    "        target_dB=config.data.target_dB,\n",
+    "        random_seed=config.data.random_seed,\n",
+    "        keep_transcription_text=True)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "2021-03-26 02:55:57,930 - INFO - [checkpoint] Rank 0: loaded model from examples/aishell/ckpt-loss2e-3-0.83-5/checkpoints/step-11725.pdparams\n"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "layer summary:\n",
+      "encoder.conv.conv_in.conv.weight|[32, 1, 41, 11]|14432\n",
+      "encoder.conv.conv_in.bn.weight|[32]|32\n",
+      "encoder.conv.conv_in.bn.bias|[32]|32\n",
+      "encoder.conv.conv_in.bn._mean|[32]|32\n",
+      "encoder.conv.conv_in.bn._variance|[32]|32\n",
+      "encoder.conv.conv_stack.0.conv.weight|[32, 32, 21, 11]|236544\n",
+      "encoder.conv.conv_stack.0.bn.weight|[32]|32\n",
+      "encoder.conv.conv_stack.0.bn.bias|[32]|32\n",
+      "encoder.conv.conv_stack.0.bn._mean|[32]|32\n",
+      "encoder.conv.conv_stack.0.bn._variance|[32]|32\n",
+      "encoder.rnn.rnn_stacks.0.fw_fc.weight|[1312, 3072]|4030464\n",
+      "encoder.rnn.rnn_stacks.0.fw_bn.weight|[3072]|3072\n",
+      "encoder.rnn.rnn_stacks.0.fw_bn.bias|[3072]|3072\n",
+      "encoder.rnn.rnn_stacks.0.fw_bn._mean|[3072]|3072\n",
+      "encoder.rnn.rnn_stacks.0.fw_bn._variance|[3072]|3072\n",
+      "encoder.rnn.rnn_stacks.0.bw_fc.weight|[1312, 3072]|4030464\n",
+      "encoder.rnn.rnn_stacks.0.bw_bn.weight|[3072]|3072\n",
+      "encoder.rnn.rnn_stacks.0.bw_bn.bias|[3072]|3072\n",
+      "encoder.rnn.rnn_stacks.0.bw_bn._mean|[3072]|3072\n",
+      "encoder.rnn.rnn_stacks.0.bw_bn._variance|[3072]|3072\n",
+      "encoder.rnn.rnn_stacks.0.fw_cell.weight_hh|[3072, 1024]|3145728\n",
+      "encoder.rnn.rnn_stacks.0.fw_cell.bias_hh|[3072]|3072\n",
+      "encoder.rnn.rnn_stacks.0.bw_cell.weight_hh|[3072, 1024]|3145728\n",
+      "encoder.rnn.rnn_stacks.0.bw_cell.bias_hh|[3072]|3072\n",
+      "encoder.rnn.rnn_stacks.0.fw_rnn.cell.weight_hh|[3072, 1024]|3145728\n",
+      "encoder.rnn.rnn_stacks.0.fw_rnn.cell.bias_hh|[3072]|3072\n",
+      "encoder.rnn.rnn_stacks.0.bw_rnn.cell.weight_hh|[3072, 1024]|3145728\n",
+      "encoder.rnn.rnn_stacks.0.bw_rnn.cell.bias_hh|[3072]|3072\n",
+      "encoder.rnn.rnn_stacks.1.fw_fc.weight|[2048, 3072]|6291456\n",
+      "encoder.rnn.rnn_stacks.1.fw_bn.weight|[3072]|3072\n",
+      "encoder.rnn.rnn_stacks.1.fw_bn.bias|[3072]|3072\n",
+      "encoder.rnn.rnn_stacks.1.fw_bn._mean|[3072]|3072\n",
+      "encoder.rnn.rnn_stacks.1.fw_bn._variance|[3072]|3072\n",
+      "encoder.rnn.rnn_stacks.1.bw_fc.weight|[2048, 3072]|6291456\n",
+      "encoder.rnn.rnn_stacks.1.bw_bn.weight|[3072]|3072\n",
+      "encoder.rnn.rnn_stacks.1.bw_bn.bias|[3072]|3072\n",
+      "encoder.rnn.rnn_stacks.1.bw_bn._mean|[3072]|3072\n",
+      "encoder.rnn.rnn_stacks.1.bw_bn._variance|[3072]|3072\n",
+      "encoder.rnn.rnn_stacks.1.fw_cell.weight_hh|[3072, 1024]|3145728\n",
+      "encoder.rnn.rnn_stacks.1.fw_cell.bias_hh|[3072]|3072\n",
+      "encoder.rnn.rnn_stacks.1.bw_cell.weight_hh|[3072, 1024]|3145728\n",
+      "encoder.rnn.rnn_stacks.1.bw_cell.bias_hh|[3072]|3072\n",
+      "encoder.rnn.rnn_stacks.1.fw_rnn.cell.weight_hh|[3072, 1024]|3145728\n",
+      "encoder.rnn.rnn_stacks.1.fw_rnn.cell.bias_hh|[3072]|3072\n",
+      "encoder.rnn.rnn_stacks.1.bw_rnn.cell.weight_hh|[3072, 1024]|3145728\n",
+      "encoder.rnn.rnn_stacks.1.bw_rnn.cell.bias_hh|[3072]|3072\n",
+      "encoder.rnn.rnn_stacks.2.fw_fc.weight|[2048, 3072]|6291456\n",
+      "encoder.rnn.rnn_stacks.2.fw_bn.weight|[3072]|3072\n",
+      "encoder.rnn.rnn_stacks.2.fw_bn.bias|[3072]|3072\n",
+      "encoder.rnn.rnn_stacks.2.fw_bn._mean|[3072]|3072\n",
+      "encoder.rnn.rnn_stacks.2.fw_bn._variance|[3072]|3072\n",
+      "encoder.rnn.rnn_stacks.2.bw_fc.weight|[2048, 3072]|6291456\n",
+      "encoder.rnn.rnn_stacks.2.bw_bn.weight|[3072]|3072\n",
+      "encoder.rnn.rnn_stacks.2.bw_bn.bias|[3072]|3072\n",
+      "encoder.rnn.rnn_stacks.2.bw_bn._mean|[3072]|3072\n",
+      "encoder.rnn.rnn_stacks.2.bw_bn._variance|[3072]|3072\n",
+      "encoder.rnn.rnn_stacks.2.fw_cell.weight_hh|[3072, 1024]|3145728\n",
+      "encoder.rnn.rnn_stacks.2.fw_cell.bias_hh|[3072]|3072\n",
+      "encoder.rnn.rnn_stacks.2.bw_cell.weight_hh|[3072, 1024]|3145728\n",
+      "encoder.rnn.rnn_stacks.2.bw_cell.bias_hh|[3072]|3072\n",
+      "encoder.rnn.rnn_stacks.2.fw_rnn.cell.weight_hh|[3072, 1024]|3145728\n",
+      "encoder.rnn.rnn_stacks.2.fw_rnn.cell.bias_hh|[3072]|3072\n",
+      "encoder.rnn.rnn_stacks.2.bw_rnn.cell.weight_hh|[3072, 1024]|3145728\n",
+      "encoder.rnn.rnn_stacks.2.bw_rnn.cell.bias_hh|[3072]|3072\n",
+      "decoder.ctc_lo.weight|[2048, 4300]|8806400\n",
+      "decoder.ctc_lo.bias|[4300]|4300\n",
+      "layer has 66 parameters, 80148012 elements.\n"
+     ]
+    }
+   ],
+   "source": [
+    "model = DeepSpeech2InferModel.from_pretrained(dataset, config,\n",
+    "                                             args.checkpoint_path)\n",
+    "model.eval()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 22,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "\n",
+      "examples/aishell/jit.model.pdmodel\n",
+      "examples/aishell/jit.model.pdiparams\n",
+      "0\n",
+      "False\n"
+     ]
+    }
+   ],
+   "source": [
+    "\n",
+    "from paddle.inference import Config\n",
+    "from paddle.inference import PrecisionType\n",
+    "from paddle.inference import create_predictor\n",
+    "\n",
+    "args.use_gpu=False\n",
+    "paddle.set_device('cpu')\n",
+    "\n",
+    "def init_predictor(args):\n",
+    "    if args.model_dir is not None:\n",
+    "        config = Config(args.model_dir)\n",
+    "    else:\n",
+    "        config = Config(args.model_file, args.params_file)\n",
+    "\n",
+    "    if args.use_gpu:\n",
+    "        config.enable_use_gpu(memory_pool_init_size_mb=1000, device_id=0)\n",
+    "#         config.enable_tensorrt_engine(precision_mode=PrecisionType.Float32,\n",
+    "#                               use_calib_mode=True) # 开启TensorRT预测，精度为fp32，开启int8离线量化\n",
+    "    else:\n",
+    "        # If not specific mkldnn, you can set the blas thread.\n",
+    "        # The thread num should not be greater than the number of cores in the CPU.\n",
+    "        config.set_cpu_math_library_num_threads(1)\n",
+    "        config.enable_mkldnn()\n",
+    "        \n",
+    "    config.enable_memory_optim()\n",
+    "    config.switch_ir_optim(True)\n",
+    "    \n",
+    "    print(config.model_dir())\n",
+    "    print(config.prog_file())\n",
+    "    print(config.params_file())\n",
+    "    print(config.gpu_device_id())\n",
+    "    print(args.use_gpu)\n",
+    "    predictor = create_predictor(config)\n",
+    "    return predictor\n",
+    "\n",
+    "def run(predictor, audio, audio_len):\n",
+    "    # copy img data to input tensor\n",
+    "    input_names = predictor.get_input_names()\n",
+    "    for i, name in enumerate(input_names):\n",
+    "        print(\"input:\", i, name)\n",
+    "        \n",
+    "    audio_tensor = predictor.get_input_handle('audio')\n",
+    "    audio_tensor.reshape(audio.shape)\n",
+    "    audio_tensor.copy_from_cpu(audio.copy())\n",
+    "    \n",
+    "    audiolen_tensor = predictor.get_input_handle('audio_len')\n",
+    "    audiolen_tensor.reshape(audio_len.shape)\n",
+    "    audiolen_tensor.copy_from_cpu(audio_len.copy())\n",
+    "\n",
+    "    output_names = predictor.get_output_names()\n",
+    "    for i, name in enumerate(output_names):\n",
+    "        print(\"output:\", i, name)\n",
+    "\n",
+    "    # do the inference\n",
+    "    predictor.run()\n",
+    "\n",
+    "    results = []\n",
+    "    # get out data from output tensor\n",
+    "    output_names = predictor.get_output_names()\n",
+    "    for i, name in enumerate(output_names):\n",
+    "        output_tensor = predictor.get_output_handle(name)\n",
+    "        output_data = output_tensor.copy_to_cpu()\n",
+    "        results.append(output_data)\n",
+    "\n",
+    "    return results\n",
+    "\n",
+    "\n",
+    "predictor = init_predictor(args)\n",
+    "\n",
+    "def file_to_transcript(filename):\n",
+    "        print(filename)\n",
+    "        feature = dataset.process_utterance(filename, \"\")\n",
+    "        audio = np.array([feature[0]]).astype('float32')  #[1, D, T]\n",
+    "        audio_len = feature[0].shape[1]\n",
+    "        audio_len = np.array([audio_len]).astype('int64')  # [1]\n",
+    "        \n",
+    "        \n",
+    "        i_probs = run(predictor, audio, audio_len)\n",
+    "        print('jit:', i_probs[0], type(i_probs[0]))\n",
+    "        \n",
+    "        audio = paddle.to_tensor(audio)\n",
+    "        audio_len = paddle.to_tensor(audio_len)\n",
+    "        print(audio.shape)\n",
+    "        print(audio_len.shape)\n",
+    "        \n",
+    "        #eouts, eouts_len = model.encoder(audio, audio_len)\n",
+    "        #probs = model.decoder.softmax(eouts)\n",
+    "        probs = model.forward(audio, audio_len)\n",
+    "        print('paddle:', probs.numpy())\n",
+    "        \n",
+    "        flag = np.allclose(i_probs[0], probs.numpy())\n",
+    "        print(flag)\n",
+    "        \n",
+    "        return probs\n",
+    "\n",
+    "#         result_transcript = model.decode(\n",
+    "#             audio,\n",
+    "#             audio_len,\n",
+    "#             vocab_list=dataset.vocab_list,\n",
+    "#             decoding_method=config.decoding.decoding_method,\n",
+    "#             lang_model_path=config.decoding.lang_model_path,\n",
+    "#             beam_alpha=config.decoding.alpha,\n",
+    "#             beam_beta=config.decoding.beta,\n",
+    "#             beam_size=config.decoding.beam_size,\n",
+    "#             cutoff_prob=config.decoding.cutoff_prob,\n",
+    "#             cutoff_top_n=config.decoding.cutoff_top_n,\n",
+    "#             num_processes=config.decoding.num_proc_bsearch)\n",
+    "#         return result_transcript[0]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 23,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Warm-up Test Case %d: %s 0 /home/ssd5/zhanghui/DeepSpeech2.x/examples/aishell/../dataset/aishell/data_aishell/wav/test/S0764/BAC009S0764W0124.wav\n",
+      "/home/ssd5/zhanghui/DeepSpeech2.x/examples/aishell/../dataset/aishell/data_aishell/wav/test/S0764/BAC009S0764W0124.wav\n",
+      "input: 0 audio\n",
+      "input: 1 audio_len\n",
+      "output: 0 tmp_75\n",
+      "jit: [[[8.91786298e-12 4.45648032e-12 3.67572750e-09 ... 8.91767563e-12\n",
+      "   8.91573707e-12 4.64317296e-08]\n",
+      "  [1.55950222e-15 2.62794089e-14 4.50423509e-12 ... 1.55944271e-15\n",
+      "   1.55891342e-15 9.99992609e-01]\n",
+      "  [1.24638127e-17 7.61802427e-16 2.93265812e-14 ... 1.24633371e-17\n",
+      "   1.24587264e-17 1.00000000e+00]\n",
+      "  ...\n",
+      "  [4.37488240e-15 2.43676260e-12 1.98770514e-12 ... 4.37479896e-15\n",
+      "   4.37354747e-15 1.00000000e+00]\n",
+      "  [3.89334696e-13 1.66754856e-11 1.42900388e-11 ... 3.89329492e-13\n",
+      "   3.89252270e-13 1.00000000e+00]\n",
+      "  [1.00349985e-10 2.56293708e-10 2.91177582e-10 ... 1.00347876e-10\n",
+      "   1.00334095e-10 9.99998808e-01]]] <class 'numpy.ndarray'>\n",
+      "[1, 161, 522]\n",
+      "[1]\n",
+      "paddle: [[[8.91789680e-12 4.45649724e-12 3.67574149e-09 ... 8.91770945e-12\n",
+      "   8.91577090e-12 4.64319072e-08]\n",
+      "  [1.55950222e-15 2.62794089e-14 4.50423509e-12 ... 1.55944271e-15\n",
+      "   1.55891342e-15 9.99992609e-01]\n",
+      "  [1.24638599e-17 7.61805339e-16 2.93267472e-14 ... 1.24633842e-17\n",
+      "   1.24587735e-17 1.00000000e+00]\n",
+      "  ...\n",
+      "  [4.37488240e-15 2.43676737e-12 1.98770514e-12 ... 4.37479896e-15\n",
+      "   4.37354747e-15 1.00000000e+00]\n",
+      "  [3.89336187e-13 1.66755481e-11 1.42900925e-11 ... 3.89330983e-13\n",
+      "   3.89253761e-13 1.00000000e+00]\n",
+      "  [1.00349985e-10 2.56293708e-10 2.91177582e-10 ... 1.00347876e-10\n",
+      "   1.00334095e-10 9.99998808e-01]]]\n",
+      "False\n"
+     ]
+    }
+   ],
+   "source": [
+    "manifest = read_manifest(args.warmup_manifest)\n",
+    "\n",
+    "for idx, sample in enumerate(manifest[:1]):\n",
+    "    print(\"Warm-up Test Case %d: %s\", idx, sample['audio_filepath'])\n",
+    "    start_time = time.time()\n",
+    "    transcript = file_to_transcript(sample['audio_filepath'])\n",
+    "    finish_time = time.time()\n",
+    "#     print(\"Response Time: %f, Transcript: %s\" %\n",
+    "#           (finish_time - start_time, transcript))\n",
+    "    break"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 21,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "(1, 161, 522) (1,)\n",
+      "input: 0 audio\n",
+      "input: 1 audio_len\n",
+      "output: 0 tmp_75\n",
+      "jit: [[[8.91789680e-12 4.45649724e-12 3.67574149e-09 ... 8.91770945e-12\n",
+      "   8.91577090e-12 4.64319072e-08]\n",
+      "  [1.55950222e-15 2.62794089e-14 4.50423509e-12 ... 1.55944271e-15\n",
+      "   1.55891342e-15 9.99992609e-01]\n",
+      "  [1.24638599e-17 7.61805339e-16 2.93267472e-14 ... 1.24633842e-17\n",
+      "   1.24587735e-17 1.00000000e+00]\n",
+      "  ...\n",
+      "  [4.37488240e-15 2.43676737e-12 1.98770514e-12 ... 4.37479896e-15\n",
+      "   4.37354747e-15 1.00000000e+00]\n",
+      "  [3.89336187e-13 1.66755481e-11 1.42900925e-11 ... 3.89330983e-13\n",
+      "   3.89253761e-13 1.00000000e+00]\n",
+      "  [1.00349985e-10 2.56293708e-10 2.91177582e-10 ... 1.00347876e-10\n",
+      "   1.00334095e-10 9.99998808e-01]]]\n"
+     ]
+    }
+   ],
+   "source": [
+    "def test(filename):\n",
+    "    feature = dataset.process_utterance(filename, \"\")\n",
+    "    audio = np.array([feature[0]]).astype('float32')  #[1, D, T]\n",
+    "    audio_len = feature[0].shape[1]\n",
+    "    audio_len = np.array([audio_len]).astype('int64')  # [1]\n",
+    "    \n",
+    "    print(audio.shape, audio_len.shape)\n",
+    "\n",
+    "    i_probs = run(predictor, audio, audio_len)\n",
+    "    print('jit:', i_probs[0])\n",
+    "    return i_probs\n",
+    "    \n",
+    "probs = test(sample['audio_filepath'])"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.7.0"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
--- a/.notebook/layer_norm_test.ipynb
+++ b/.notebook/layer_norm_test.ipynb
@ -0,0 +1,229 @@
+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": 32,
+   "id": "academic-surname",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import paddle\n",
+    "from paddle import nn"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 33,
+   "id": "fundamental-treasure",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Parameter containing:\n",
+      "Tensor(shape=[256], dtype=float32, place=CUDAPlace(0), stop_gradient=False,\n",
+      "       [1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])\n",
+      "Parameter containing:\n",
+      "Tensor(shape=[256], dtype=float32, place=CUDAPlace(0), stop_gradient=False,\n",
+      "       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])\n"
+     ]
+    }
+   ],
+   "source": [
+    "L = nn.LayerNorm(256, epsilon=1e-12)\n",
+    "for p in L.parameters():\n",
+    "    print(p)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 34,
+   "id": "consolidated-elephant",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import numpy as np\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 46,
+   "id": "moderate-noise",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "float64\n"
+     ]
+    }
+   ],
+   "source": [
+    "x = np.random.randn(2, 51, 256)\n",
+    "print(x.dtype)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 47,
+   "id": "cooked-progressive",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "y = L(paddle.to_tensor(x, dtype='float32'))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 48,
+   "id": "optimum-milwaukee",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import torch"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 49,
+   "id": "viral-indian",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Parameter containing:\n",
+      "tensor([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,\n",
+      "        1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,\n",
+      "        1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,\n",
+      "        1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,\n",
+      "        1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,\n",
+      "        1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,\n",
+      "        1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,\n",
+      "        1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,\n",
+      "        1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,\n",
+      "        1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,\n",
+      "        1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,\n",
+      "        1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,\n",
+      "        1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,\n",
+      "        1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,\n",
+      "        1., 1., 1., 1.], requires_grad=True)\n",
+      "Parameter containing:\n",
+      "tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,\n",
+      "        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,\n",
+      "        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,\n",
+      "        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,\n",
+      "        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,\n",
+      "        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,\n",
+      "        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,\n",
+      "        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,\n",
+      "        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,\n",
+      "        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,\n",
+      "        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],\n",
+      "       requires_grad=True)\n"
+     ]
+    }
+   ],
+   "source": [
+    "TL = torch.nn.LayerNorm(256, eps=1e-12)\n",
+    "for p in TL.parameters():\n",
+    "    print(p)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 50,
+   "id": "skilled-vietnamese",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "ty = TL(torch.tensor(x, dtype=torch.float32))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 51,
+   "id": "incorrect-allah",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "False"
+      ]
+     },
+     "execution_count": 51,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "np.allclose(y.numpy(), ty.detach().numpy())"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "prostate-cameroon",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 52,
+   "id": "governmental-surge",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "True"
+      ]
+     },
+     "execution_count": 52,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "x = np.random.randn(2, 256)\n",
+    "y = L(paddle.to_tensor(x, dtype='float32'))\n",
+    "ty = TL(torch.tensor(x, dtype=torch.float32))\n",
+    "np.allclose(y.numpy(), ty.detach().numpy())"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "confidential-jacket",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.7.0"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
--- a/.notebook/mask_and_masked_fill_test.ipynb
+++ b/.notebook/mask_and_masked_fill_test.ipynb
@ -0,0 +1,449 @@
+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "id": "primary-organic",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import torch"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 38,
+   "id": "stopped-semester",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def mask_finished_scores(score: torch.Tensor,\n",
+    "                         flag: torch.Tensor) -> torch.Tensor:\n",
+    "    \"\"\"\n",
+    "    If a sequence is finished, we only allow one alive branch. This function\n",
+    "    aims to give one branch a zero score and the rest -inf score.\n",
+    "    Args:\n",
+    "        score (torch.Tensor): A real value array with shape\n",
+    "            (batch_size * beam_size, beam_size).\n",
+    "        flag (torch.Tensor): A bool array with shape\n",
+    "            (batch_size * beam_size, 1).\n",
+    "    Returns:\n",
+    "        torch.Tensor: (batch_size * beam_size, beam_size).\n",
+    "    \"\"\"\n",
+    "    beam_size = score.size(-1)\n",
+    "    zero_mask = torch.zeros_like(flag, dtype=torch.bool)\n",
+    "    if beam_size > 1:\n",
+    "        unfinished = torch.cat((zero_mask, flag.repeat([1, beam_size - 1])),\n",
+    "                               dim=1)\n",
+    "        finished = torch.cat((flag, zero_mask.repeat([1, beam_size - 1])),\n",
+    "                             dim=1)\n",
+    "    else:\n",
+    "        unfinished = zero_mask\n",
+    "        finished = flag\n",
+    "    print(unfinished)\n",
+    "    print(finished)\n",
+    "    score.masked_fill_(unfinished, -float('inf'))\n",
+    "    score.masked_fill_(finished, 0)\n",
+    "    return score"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 58,
+   "id": "agreed-portuguese",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "tensor([[ True],\n",
+      "        [False]])\n",
+      "tensor([[-0.8841,  0.7381, -0.9986],\n",
+      "        [ 0.2675, -0.7971,  0.3798]])\n",
+      "tensor([[ True,  True],\n",
+      "        [False, False]])\n"
+     ]
+    }
+   ],
+   "source": [
+    "score = torch.randn((2, 3))\n",
+    "flag = torch.ones((2, 1), dtype=torch.bool)\n",
+    "flag[1] = False\n",
+    "print(flag)\n",
+    "print(score)\n",
+    "print(flag.repeat([1, 2]))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 59,
+   "id": "clean-aspect",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "tensor([[False,  True,  True],\n",
+      "        [False, False, False]])\n",
+      "tensor([[ True, False, False],\n",
+      "        [False, False, False]])\n",
+      "tensor([[ 0.0000,    -inf,    -inf],\n",
+      "        [ 0.2675, -0.7971,  0.3798]])\n",
+      "tensor([[ 0.0000,    -inf,    -inf],\n",
+      "        [ 0.2675, -0.7971,  0.3798]])\n"
+     ]
+    }
+   ],
+   "source": [
+    "r  = mask_finished_scores(score, flag)\n",
+    "print(r)\n",
+    "print(score)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 55,
+   "id": "thrown-airline",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Tensor(shape=[2, 1], dtype=bool, place=CUDAPlace(0), stop_gradient=True,\n",
+      "       [[True ],\n",
+      "        [False]])\n",
+      "Tensor(shape=[2, 3], dtype=float32, place=CUDAPlace(0), stop_gradient=True,\n",
+      "       [[ 2.05994511,  1.87704289,  0.01988174],\n",
+      "        [-0.40165186,  0.77547729, -0.64469045]])\n",
+      "Tensor(shape=[2, 2], dtype=bool, place=CUDAPlace(0), stop_gradient=True,\n",
+      "       [[True , True ],\n",
+      "        [False, False]])\n"
+     ]
+    }
+   ],
+   "source": [
+    "import paddle\n",
+    "\n",
+    "score = paddle.randn((2, 3))\n",
+    "flag = paddle.ones((2, 1), dtype='bool')\n",
+    "flag[1] = False\n",
+    "print(flag)\n",
+    "print(score)\n",
+    "print(flag.tile([1, 2]))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 56,
+   "id": "internal-patent",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Tensor(shape=[2, 3], dtype=bool, place=CUDAPlace(0), stop_gradient=True,\n",
+      "       [[False, True , True ],\n",
+      "        [False, False, False]])\n",
+      "Tensor(shape=[2, 3], dtype=bool, place=CUDAPlace(0), stop_gradient=True,\n",
+      "       [[True , False, False],\n",
+      "        [False, False, False]])\n",
+      "x Tensor(shape=[2, 3], dtype=float32, place=CUDAPlace(0), stop_gradient=True,\n",
+      "       [[ 2.05994511,  1.87704289,  0.01988174],\n",
+      "        [-0.40165186,  0.77547729, -0.64469045]])\n",
+      "2 Tensor(shape=[2, 3], dtype=float32, place=CUDAPlace(0), stop_gradient=True,\n",
+      "       [[ 2.05994511,  1.87704289,  0.01988174],\n",
+      "        [-0.40165186,  0.77547729, -0.64469045]])\n",
+      "3 Tensor(shape=[2, 3], dtype=float32, place=CUDAPlace(0), stop_gradient=True,\n",
+      "       [[ 2.05994511, -inf.      , -inf.      ],\n",
+      "        [-0.40165186,  0.77547729, -0.64469045]])\n",
+      "x Tensor(shape=[2, 3], dtype=float32, place=CUDAPlace(0), stop_gradient=True,\n",
+      "       [[ 2.05994511, -inf.      , -inf.      ],\n",
+      "        [-0.40165186,  0.77547729, -0.64469045]])\n",
+      "2 Tensor(shape=[2, 3], dtype=float32, place=CUDAPlace(0), stop_gradient=True,\n",
+      "       [[ 2.05994511, -inf.      , -inf.      ],\n",
+      "        [-0.40165186,  0.77547729, -0.64469045]])\n",
+      "3 Tensor(shape=[2, 3], dtype=float32, place=CUDAPlace(0), stop_gradient=True,\n",
+      "       [[ 0.        , -inf.      , -inf.      ],\n",
+      "        [-0.40165186,  0.77547729, -0.64469045]])\n",
+      "Tensor(shape=[2, 3], dtype=float32, place=CUDAPlace(0), stop_gradient=True,\n",
+      "       [[ 0.        , -inf.      , -inf.      ],\n",
+      "        [-0.40165186,  0.77547729, -0.64469045]])\n"
+     ]
+    }
+   ],
+   "source": [
+    "paddle.bool = 'bool'\n",
+    "\n",
+    "def masked_fill(xs:paddle.Tensor, mask:paddle.Tensor, value:float):\n",
+    "    print(xs)\n",
+    "    trues = paddle.ones_like(xs) * value\n",
+    "    assert xs.shape == mask.shape\n",
+    "    xs = paddle.where(mask, trues, xs)\n",
+    "    return xs\n",
+    "\n",
+    "def masked_fill_(xs:paddle.Tensor, mask:paddle.Tensor, value:float):\n",
+    "    print('x', xs)\n",
+    "    trues = paddle.ones_like(xs) * value\n",
+    "    assert xs.shape == mask.shape\n",
+    "    ret = paddle.where(mask, trues, xs)\n",
+    "    print('2', xs)\n",
+    "    paddle.assign(ret, output=xs)\n",
+    "    print('3', xs)\n",
+    "\n",
+    "paddle.Tensor.masked_fill = masked_fill\n",
+    "paddle.Tensor.masked_fill_ = masked_fill_\n",
+    "\n",
+    "def mask_finished_scores_pd(score: paddle.Tensor,\n",
+    "                         flag: paddle.Tensor) -> paddle.Tensor:\n",
+    "    \"\"\"\n",
+    "    If a sequence is finished, we only allow one alive branch. This function\n",
+    "    aims to give one branch a zero score and the rest -inf score.\n",
+    "    Args:\n",
+    "        score (torch.Tensor): A real value array with shape\n",
+    "            (batch_size * beam_size, beam_size).\n",
+    "        flag (torch.Tensor): A bool array with shape\n",
+    "            (batch_size * beam_size, 1).\n",
+    "    Returns:\n",
+    "        torch.Tensor: (batch_size * beam_size, beam_size).\n",
+    "    \"\"\"\n",
+    "    beam_size = score.shape[-1]\n",
+    "    zero_mask = paddle.zeros_like(flag, dtype=paddle.bool)\n",
+    "    if beam_size > 1:\n",
+    "        unfinished = paddle.concat((zero_mask, flag.tile([1, beam_size - 1])),\n",
+    "                               axis=1)\n",
+    "        finished = paddle.concat((flag, zero_mask.tile([1, beam_size - 1])),\n",
+    "                             axis=1)\n",
+    "    else:\n",
+    "        unfinished = zero_mask\n",
+    "        finished = flag\n",
+    "    print(unfinished)\n",
+    "    print(finished)\n",
+    "    \n",
+    "    #score.masked_fill_(unfinished, -float('inf'))\n",
+    "    #score.masked_fill_(finished, 0)\n",
+    "#     infs = paddle.ones_like(score) * -float('inf')\n",
+    "#     score = paddle.where(unfinished, infs, score)\n",
+    "#     score = paddle.where(finished, paddle.zeros_like(score), score)\n",
+    "\n",
+    "#     score = score.masked_fill(unfinished, -float('inf'))\n",
+    "#     score = score.masked_fill(finished, 0)\n",
+    "    score.masked_fill_(unfinished, -float('inf'))\n",
+    "    score.masked_fill_(finished, 0)\n",
+    "    return score\n",
+    "\n",
+    "r  = mask_finished_scores_pd(score, flag)\n",
+    "print(r)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 57,
+   "id": "vocal-prime",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "<bound method PyCapsule.value of Tensor(shape=[2, 3], dtype=float32, place=CUDAPlace(0), stop_gradient=True,\n",
+       "       [[ 0.        , -inf.      , -inf.      ],\n",
+       "        [-0.40165186,  0.77547729, -0.64469045]])>"
+      ]
+     },
+     "execution_count": 57,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "score.value"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 71,
+   "id": "bacterial-adolescent",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from typing import Union, Any"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 72,
+   "id": "absent-fiber",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def repeat(xs : paddle.Tensor, *size: Any):\n",
+    "    print(size)\n",
+    "    return paddle.tile(xs, size)\n",
+    "paddle.Tensor.repeat = repeat"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 73,
+   "id": "material-harbor",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "(1, 2)\n",
+      "Tensor(shape=[2, 2], dtype=bool, place=CUDAPlace(0), stop_gradient=True,\n",
+      "       [[True , True ],\n",
+      "        [False, False]])\n"
+     ]
+    }
+   ],
+   "source": [
+    "flag = paddle.ones((2, 1), dtype='bool')\n",
+    "flag[1] = False\n",
+    "print(flag.repeat(1, 2))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 84,
+   "id": "acute-brighton",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "(Tensor(shape=[1], dtype=int64, place=CUDAPlace(0), stop_gradient=True,\n",
+      "       [1]), 2)\n",
+      "Tensor(shape=[2, 2], dtype=bool, place=CUDAPlace(0), stop_gradient=True,\n",
+      "       [[True , True ],\n",
+      "        [False, False]])\n"
+     ]
+    }
+   ],
+   "source": [
+    "flag = paddle.ones((2, 1), dtype='bool')\n",
+    "flag[1] = False\n",
+    "print(flag.repeat(paddle.to_tensor(1), 2))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 85,
+   "id": "european-rugby",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def size(xs, *args: int):\n",
+    "    nargs = len(args)\n",
+    "    s = paddle.shape(xs)\n",
+    "    assert(nargs <= 1)\n",
+    "    if nargs == 1:\n",
+    "        return s[args[0]]\n",
+    "    else:\n",
+    "        return s\n",
+    "paddle.Tensor.size = size"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 86,
+   "id": "moral-special",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "Tensor(shape=[2], dtype=int32, place=CPUPlace, stop_gradient=True,\n",
+       "       [2, 1])"
+      ]
+     },
+     "execution_count": 86,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "flag.size()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 87,
+   "id": "ahead-coach",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "Tensor(shape=[1], dtype=int32, place=CPUPlace, stop_gradient=True,\n",
+       "       [1])"
+      ]
+     },
+     "execution_count": 87,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "flag.size(1)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 88,
+   "id": "incomplete-fitness",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "Tensor(shape=[1], dtype=int32, place=CPUPlace, stop_gradient=True,\n",
+       "       [2])"
+      ]
+     },
+     "execution_count": 88,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "flag.size(0)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "upset-connectivity",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.7.0"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
--- a/.notebook/position_embeding_check.ipynb
+++ b/.notebook/position_embeding_check.ipynb
@ -0,0 +1,231 @@
+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "id": "designing-borough",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "/workspace/DeepSpeech-2.x/tools/venv/lib/python3.7/site-packages/ipykernel/ipkernel.py:283: DeprecationWarning: `should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.\n",
+      "  and should_run_async(code)\n"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "[[ 0.0000000e+00  0.0000000e+00  0.0000000e+00 ...  0.0000000e+00\n",
+      "   0.0000000e+00  0.0000000e+00]\n",
+      " [ 8.4147096e-01  8.0196178e-01  7.6172036e-01 ...  1.2409373e-04\n",
+      "   1.1547816e-04  1.0746076e-04]\n",
+      " [ 9.0929741e-01  9.5814437e-01  9.8704624e-01 ...  2.4818745e-04\n",
+      "   2.3095631e-04  2.1492151e-04]\n",
+      " ...\n",
+      " [ 3.7960774e-01  7.4510968e-01  7.3418564e-01 ...  1.2036801e-02\n",
+      "   1.1201146e-02  1.0423505e-02]\n",
+      " [-5.7338190e-01 -8.9752287e-02 -4.1488394e-02 ...  1.2160885e-02\n",
+      "   1.1316618e-02  1.0530960e-02]\n",
+      " [-9.9920684e-01 -8.5234123e-01 -7.8794664e-01 ...  1.2284970e-02\n",
+      "   1.1432089e-02  1.0638415e-02]]\n",
+      "True\n",
+      "True\n"
+     ]
+    }
+   ],
+   "source": [
+    "import torch\n",
+    "import math\n",
+    "import numpy as np\n",
+    "\n",
+    "max_len=100\n",
+    "d_model=256\n",
+    "\n",
+    "pe = torch.zeros(max_len, d_model)\n",
+    "position = torch.arange(0, max_len,\n",
+    "                        dtype=torch.float32).unsqueeze(1)\n",
+    "toruch_position = position\n",
+    "div_term = torch.exp(\n",
+    "    torch.arange(0, d_model, 2, dtype=torch.float32) *\n",
+    "    -(math.log(10000.0) / d_model))\n",
+    "tourch_div_term = div_term.cpu().detach().numpy()\n",
+    "\n",
+    "\n",
+    "\n",
+    "torhc_sin = torch.sin(position * div_term)\n",
+    "torhc_cos = torch.cos(position * div_term)\n",
+    "print(torhc_sin.cpu().detach().numpy())\n",
+    "np_sin = np.sin((position * div_term).cpu().detach().numpy())\n",
+    "np_cos = np.cos((position * div_term).cpu().detach().numpy())\n",
+    "print(np.allclose(np_sin, torhc_sin.cpu().detach().numpy()))\n",
+    "print(np.allclose(np_cos, torhc_cos.cpu().detach().numpy()))\n",
+    "pe[:, 0::2] = torhc_sin\n",
+    "pe[:, 1::2] = torhc_cos\n",
+    "tourch_pe = pe.cpu().detach().numpy()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "id": "swiss-referral",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "True\n",
+      "True\n",
+      "False\n",
+      "False\n",
+      "False\n",
+      "False\n",
+      "[[ 1.          1.          1.         ...  1.          1.\n",
+      "   1.        ]\n",
+      " [ 0.5403023   0.59737533  0.6479059  ...  1.          1.\n",
+      "   1.        ]\n",
+      " [-0.41614684 -0.28628543 -0.1604359  ...  0.99999994  1.\n",
+      "   1.        ]\n",
+      " ...\n",
+      " [-0.92514753 -0.66694194 -0.67894876 ...  0.9999276   0.99993724\n",
+      "   0.9999457 ]\n",
+      " [-0.81928825 -0.9959641  -0.999139   ...  0.99992603  0.999936\n",
+      "   0.99994457]\n",
+      " [ 0.03982088 -0.52298605 -0.6157435  ...  0.99992454  0.9999347\n",
+      "   0.99994344]]\n",
+      "----\n",
+      "[[ 1.          1.          1.         ...  1.          1.\n",
+      "   1.        ]\n",
+      " [ 0.54030234  0.59737533  0.6479059  ...  1.          1.\n",
+      "   1.        ]\n",
+      " [-0.41614684 -0.28628543 -0.1604359  ...  1.          1.\n",
+      "   1.        ]\n",
+      " ...\n",
+      " [-0.92514753 -0.66694194 -0.67894876 ...  0.9999276   0.9999373\n",
+      "   0.9999457 ]\n",
+      " [-0.81928825 -0.9959641  -0.999139   ...  0.99992603  0.999936\n",
+      "   0.99994457]\n",
+      " [ 0.03982088 -0.5229861  -0.6157435  ...  0.99992454  0.9999347\n",
+      "   0.99994344]]\n",
+      ")))))))\n",
+      "[[ 0.0000000e+00  0.0000000e+00  0.0000000e+00 ...  0.0000000e+00\n",
+      "   0.0000000e+00  0.0000000e+00]\n",
+      " [ 8.4147096e-01  8.0196178e-01  7.6172036e-01 ...  1.2409373e-04\n",
+      "   1.1547816e-04  1.0746076e-04]\n",
+      " [ 9.0929741e-01  9.5814437e-01  9.8704624e-01 ...  2.4818745e-04\n",
+      "   2.3095631e-04  2.1492151e-04]\n",
+      " ...\n",
+      " [ 3.7960774e-01  7.4510968e-01  7.3418564e-01 ...  1.2036801e-02\n",
+      "   1.1201146e-02  1.0423505e-02]\n",
+      " [-5.7338190e-01 -8.9752287e-02 -4.1488394e-02 ...  1.2160885e-02\n",
+      "   1.1316618e-02  1.0530960e-02]\n",
+      " [-9.9920684e-01 -8.5234123e-01 -7.8794664e-01 ...  1.2284970e-02\n",
+      "   1.1432089e-02  1.0638415e-02]]\n",
+      "----\n",
+      "[[ 0.0000000e+00  0.0000000e+00  0.0000000e+00 ...  0.0000000e+00\n",
+      "   0.0000000e+00  0.0000000e+00]\n",
+      " [ 8.4147096e-01  8.0196178e-01  7.6172036e-01 ...  1.2409373e-04\n",
+      "   1.1547816e-04  1.0746076e-04]\n",
+      " [ 9.0929741e-01  9.5814437e-01  9.8704624e-01 ...  2.4818745e-04\n",
+      "   2.3095631e-04  2.1492151e-04]\n",
+      " ...\n",
+      " [ 3.7960774e-01  7.4510968e-01  7.3418564e-01 ...  1.2036801e-02\n",
+      "   1.1201146e-02  1.0423505e-02]\n",
+      " [-5.7338190e-01 -8.9752287e-02 -4.1488394e-02 ...  1.2160885e-02\n",
+      "   1.1316618e-02  1.0530960e-02]\n",
+      " [-9.9920684e-01 -8.5234123e-01 -7.8794664e-01 ...  1.2284970e-02\n",
+      "   1.1432089e-02  1.0638415e-02]]\n"
+     ]
+    }
+   ],
+   "source": [
+    "import paddle\n",
+    "paddle.set_device('cpu')\n",
+    "ppe = paddle.zeros((max_len, d_model), dtype='float32')\n",
+    "position = paddle.arange(0, max_len,\n",
+    "                        dtype='float32').unsqueeze(1)\n",
+    "print(np.allclose(position.numpy(), toruch_position))\n",
+    "div_term = paddle.exp(\n",
+    "    paddle.arange(0, d_model, 2, dtype='float32') *\n",
+    "    -(math.log(10000.0) / d_model))\n",
+    "print(np.allclose(div_term.numpy(), tourch_div_term))\n",
+    "\n",
+    "\n",
+    "\n",
+    "p_sin = paddle.sin(position * div_term)\n",
+    "p_cos = paddle.cos(position * div_term)\n",
+    "print(np.allclose(np_sin, p_sin.numpy(), rtol=1.e-6, atol=0))\n",
+    "print(np.allclose(np_cos, p_cos.numpy(), rtol=1.e-6, atol=0))\n",
+    "ppe[:, 0::2] = p_sin\n",
+    "ppe[:, 1::2] = p_cos\n",
+    "print(np.allclose(p_sin.numpy(), torhc_sin.cpu().detach().numpy()))\n",
+    "print(np.allclose(p_cos.numpy(), torhc_cos.cpu().detach().numpy()))\n",
+    "print(p_cos.numpy())\n",
+    "print(\"----\")\n",
+    "print(torhc_cos.cpu().detach().numpy())\n",
+    "print(\")))))))\")\n",
+    "print(p_sin.numpy())\n",
+    "print(\"----\")\n",
+    "print(torhc_sin.cpu().detach().numpy())"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "id": "integrated-boards",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "False\n"
+     ]
+    }
+   ],
+   "source": [
+    "print(np.allclose(ppe.numpy(), pe.numpy()))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "flying-reserve",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "revised-divide",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.7.0"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
--- a/.notebook/python_test.ipynb
+++ b/.notebook/python_test.ipynb
--- a/.notebook/train_test.ipynb
+++ b/.notebook/train_test.ipynb
@ -249,7 +249,7 @@
    }
   ],
   "source": [
-    "    for idx, (audio, text, audio_len, text_len) in enumerate(batch_reader()):\n",
+    "    for idx, (audio, audio_len, text, text_len) in enumerate(batch_reader()):\n",
    "        print('test', text)\n",
    "        print(\"test raw\", ''.join(batch_reader.dataset.vocab_list[i] for i in text[0]))\n",
    "        print(\"test raw\", ''.join(batch_reader.dataset.vocab_list[i] for i in text[-1]))\n",
@ -454,7 +454,7 @@
    "            act='brelu')\n",
    "\n",
    "        out_channel = 32\n",
-    "        self.conv_stack = nn.LayerList([\n",
+    "        self.conv_stack = nn.Sequential([\n",
    "            ConvBn(\n",
    "                num_channels_in=32,\n",
    "                num_channels_out=out_channel,\n",
@ -835,7 +835,7 @@
    "\n",
    "        return logits, probs, audio_len\n",
    "\n",
-    "    def forward(self, audio, text, audio_len, text_len):\n",
+    "    def forward(self, audio, audio_len, text, text_len):\n",
    "        \"\"\"\n",
    "        audio: shape [B, D, T]\n",
    "        text: shape [B, T]\n",
@ -877,10 +877,10 @@
   "metadata": {},
   "outputs": [],
   "source": [
-    "audio, text, audio_len, text_len = None, None, None, None\n",
+    "audio, audio_len, text, text_len = None, None, None, None\n",
    "\n",
    "for idx, inputs in enumerate(batch_reader):\n",
-    "    audio, text, audio_len, text_len = inputs\n",
+    "    audio, audio_len, text, text_len = inputs\n",
    "#     print(idx)\n",
    "#     print('a', audio.shape, audio.place)\n",
    "#     print('t', text)\n",
@ -960,7 +960,7 @@
    }
   ],
   "source": [
-    "outputs = dp_model(audio, text, audio_len, text_len)\n",
+    "outputs = dp_model(audio, audio_len, text, text_len)\n",
    "logits, _, logits_len = outputs\n",
    "print('logits len', logits_len)\n",
    "loss = loss_fn.forward(logits, text, logits_len, text_len)\n",
@ -1884,4 +1884,4 @@
 },
 "nbformat": 4,
 "nbformat_minor": 5
-}
+}
--- a/.notebook/u2_model.ipynb
+++ b/.notebook/u2_model.ipynb
--- a/.pre-commit-config.yaml
+++ b/.pre-commit-config.yaml
@ -3,6 +3,7 @@
    hooks:
    -   id: yapf
        files: \.py$
+        exclude: (?=third_party).*(\.py)$
 -   repo: https://github.com/pre-commit/pre-commit-hooks
    sha: a11d9314b22d8f8c7556443875b731ef05965464
    hooks:
@ -14,7 +15,22 @@
        files: \.md$
    -   id: trailing-whitespace
        files: \.md$
-   repo: https://github.com/Lucas-C/pre-commit-hooks
+    -   id: requirements-txt-fixer
+        exclude: (?=third_party).*$
+    -   id: check-yaml
+    -   id: check-json
+    -   id: pretty-format-json
+        args:
+        - --no-sort-keys
+        - --autofix
+    -   id: check-merge-conflict
+    -   id: flake8
+        aergs:
+        -  --ignore=E501,E228,E226,E261,E266,E128,E402,W503
+        -  --builtins=G,request
+        -  --jobs=1
+        exclude: (?=third_party).*(\.py)$
+-   repo : https://github.com/Lucas-C/pre-commit-hooks
    sha: v1.0.1
    hooks:
    -   id: forbid-crlf
@ -38,4 +54,9 @@
        entry: python .pre-commit-hooks/copyright-check.hook
        language: system
        files: \.(c|cc|cxx|cpp|cu|h|hpp|hxx|proto|py)$
-        #exclude: (?=decoders/swig).*(\.cpp|\.h)$
+        exclude: (?=third_party|pypinyin).*(\.cpp|\.h|\.py)$
+-   repo: https://github.com/asottile/reorder_python_imports
+    rev: v2.4.0
+    hooks:
+      - id: reorder-python-imports
+        exclude: (?=third_party).*(\.py)$
--- a/.travis.yml
+++ b/.travis.yml
@ -19,14 +19,14 @@ addons:
 before_install:
  -  python3 --version
  -  python3 -m pip --version
-  -  sudo pip install -U virtualenv pre-commit pip
+  -  pip3 --version
+  -  sudo pip3 install -U virtualenv pre-commit pip
  -  docker pull paddlepaddle/paddle:latest

 script:
  - exit_code=0
-  - .travis/precommit.sh || exit_code=$(( exit_code | $? ))
  - docker run -i --rm -v "$PWD:/py_unittest" paddlepaddle/paddle:latest /bin/bash -c
-    'cd /py_unittest; source env.sh; bash .travis/unittest.sh' || exit_code=$(( exit_code | $? ))
+    'cd /py_unittest && bash .travis/precommit.sh && source env.sh && bash .travis/unittest.sh' || exit_code=$(( exit_code | $? ))
    exit $exit_code

 notifications:
--- a/.travis/install.sh
+++ b/.travis/install.sh
@ -0,0 +1,37 @@
+#!/bin/bash
+
+setup_env(){
+    cd tools && make && cd - 
+}
+
+install(){
+    if [ -f "setup.sh" ]; then
+        bash setup.sh
+        #export LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH
+    fi
+    if [ $? != 0 ]; then
+        exit 1
+    fi
+}
+
+print_env(){
+    cat /etc/lsb-release
+    gcc -v
+    g++ -v
+}
+
+abort(){
+    echo "Run install failed" 1>&2
+    echo "Please check your code" 1>&2
+    exit 1
+}
+
+trap 'abort' 0
+set -e
+
+print_env
+setup_env
+source tools/venv/bin/activate
+install
+
+trap : 0
--- a/.travis/precommit.sh
+++ b/.travis/precommit.sh
@ -1,16 +1,18 @@
 #!/bin/bash
+
 function abort(){
    echo "Your commit not fit PaddlePaddle code style" 1>&2
    echo "Please use pre-commit scripts to auto-format your code" 1>&2
    exit 1
 }

+
 trap 'abort' 0
 set -e
-cd `dirname $0`
-cd ..
-export PATH=/usr/bin:$PATH
-pre-commit install
+
+source tools/venv/bin/activate
+
+python3 --version

 if ! pre-commit run -a ; then
  ls -lh
--- a/.travis/unittest.sh
+++ b/.travis/unittest.sh
@ -1,11 +1,14 @@
 #!/bin/bash

+
+
 abort(){
    echo "Run unittest failed" 1>&2
    echo "Please check your code" 1>&2
    exit 1
 }

+
 unittest(){
    cd $1 > /dev/null
    if [ -f "setup.sh" ]; then
@ -21,13 +24,31 @@ unittest(){
    cd - > /dev/null
 }

+coverage(){
+    cd $1 > /dev/null
+
+    if [ -f "setup.sh" ]; then
+        bash setup.sh
+        export LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH
+    fi
+    if [ $? != 0 ]; then
+        exit 1
+    fi
+
+    find . -path ./tools/venv -prune -false -o -name 'tests' -type d -print0 | \
+        xargs -0 -I{} -n1 bash -c \
+        'python3 -m coverage run --branch {}'
+    python3 -m coverage report -m
+    python3 -m coverage html
+    cd - > /dev/null
+}
+
 trap 'abort' 0
 set -e

-cd tools; make; cd - 
-. tools/venv/bin/activate
-pip3 install pytest
-
-unittest .
+source tools/venv/bin/activate
+#pip3 install pytest
+#unittest .
+coverage .

 trap : 0
--- a/.vimrc
+++ b/.vimrc
@ -0,0 +1,468 @@
+"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
+" Maintainer: 
+"       Amir Salihefendic — @amix3k
+"
+" Awesome_version:
+"       Get this config, nice color schemes and lots of plugins!
+"
+"       Install the awesome version from:
+"
+"           https://github.com/amix/vimrc
+"
+" Sections:
+"    -> General
+"    -> VIM user interface
+"    -> Colors and Fonts
+"    -> Files and backups
+"    -> Text, tab and indent related
+"    -> Visual mode related
+"    -> Moving around, tabs and buffers
+"    -> Status line
+"    -> Editing mappings
+"    -> vimgrep searching and cope displaying
+"    -> Spell checking
+"    -> Misc
+"    -> Helper functions
+"
+"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
+
+
+"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
+" => General
+"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
+" Sets how many lines of history VIM has to remember
+set history=500
+
+" Enable filetype plugins
+filetype plugin on
+filetype indent on
+
+" Set to auto read when a file is changed from the outside
+set autoread
+au FocusGained,BufEnter * checktime
+
+" With a map leader it's possible to do extra key combinations
+" like <leader>w saves the current file
+let mapleader = ","
+
+" Fast saving
+nmap <leader>w :w!<cr>
+
+" :W sudo saves the file 
+" (useful for handling the permission-denied error)
+command! W execute 'w !sudo tee % > /dev/null' <bar> edit!
+
+
+"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
+" => VIM user interface
+"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
+" Set 7 lines to the cursor - when moving vertically using j/k
+set so=7
+
+" Avoid garbled characters in Chinese language windows OS
+let $LANG='en' 
+set langmenu=en
+source $VIMRUNTIME/delmenu.vim
+source $VIMRUNTIME/menu.vim
+
+" Turn on the Wild menu
+set wildmenu
+
+" Ignore compiled files
+set wildignore=*.o,*~,*.pyc
+if has("win16") || has("win32")
+    set wildignore+=.git\*,.hg\*,.svn\*
+else
+    set wildignore+=*/.git/*,*/.hg/*,*/.svn/*,*/.DS_Store
+endif
+
+"Always show current position
+set ruler
+
+" Height of the command bar
+set cmdheight=1
+
+" A buffer becomes hidden when it is abandoned
+set hid
+
+" Configure backspace so it acts as it should act
+set backspace=eol,start,indent
+set whichwrap+=<,>,h,l
+
+" Ignore case when searching
+set ignorecase
+
+" When searching try to be smart about cases 
+set smartcase
+
+" Highlight search results
+set hlsearch
+
+" Makes search act like search in modern browsers
+set incsearch 
+
+" Don't redraw while executing macros (good performance config)
+set lazyredraw 
+
+" For regular expressions turn magic on
+set magic
+
+" Show matching brackets when text indicator is over them
+set showmatch 
+" How many tenths of a second to blink when matching brackets
+set mat=2
+
+" No annoying sound on errors
+set noerrorbells
+set novisualbell
+set t_vb=
+set tm=500
+
+" Properly disable sound on errors on MacVim
+if has("gui_macvim")
+    autocmd GUIEnter * set vb t_vb=
+endif
+
+
+" Add a bit extra margin to the left
+set foldcolumn=1
+
+
+"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
+" => Colors and Fonts
+"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
+" Enable syntax highlighting
+syntax enable 
+
+" Enable 256 colors palette in Gnome Terminal
+if $COLORTERM == 'gnome-terminal'
+    set t_Co=256
+endif
+
+try
+    colorscheme desert
+catch
+endtry
+
+set background=dark
+
+" Set extra options when running in GUI mode
+if has("gui_running")
+    set guioptions-=T
+    set guioptions-=e
+    set t_Co=256
+    set guitablabel=%M\ %t
+endif
+
+" Set utf8 as standard encoding and en_US as the standard language
+set encoding=utf8
+set fileencodings=ucs-bom,utf-8,cp936
+set fileencoding=gb2312
+set termencoding=utf-8
+
+" Use Unix as the standard file type
+set ffs=unix,dos,mac
+
+
+"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
+" => Files, backups and undo
+"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
+" Turn backup off, since most stuff is in SVN, git etc. anyway...
+set nobackup
+set nowb
+set noswapfile
+
+
+"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
+" => Text, tab and indent related
+"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
+" Use spaces instead of tabs
+set expandtab
+
+" Be smart when using tabs ;)
+set smarttab
+
+" 1 tab == 4 spaces
+set shiftwidth=4
+set tabstop=4
+
+" Linebreak on 500 characters
+set lbr
+set tw=500
+
+set ai "Auto indent
+set si "Smart indent
+set wrap "Wrap lines
+
+
+""""""""""""""""""""""""""""""
+" => Visual mode related
+""""""""""""""""""""""""""""""
+" Visual mode pressing * or # searches for the current selection
+" Super useful! From an idea by Michael Naumann
+vnoremap <silent> * :<C-u>call VisualSelection('', '')<CR>/<C-R>=@/<CR><CR>
+vnoremap <silent> # :<C-u>call VisualSelection('', '')<CR>?<C-R>=@/<CR><CR>
+
+
+"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
+" => Moving around, tabs, windows and buffers
+"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
+" Map <Space> to / (search) and Ctrl-<Space> to ? (backwards search)
+map <space> /
+map <C-space> ?
+
+" Disable highlight when <leader><cr> is pressed
+map <silent> <leader><cr> :noh<cr>
+
+" Smart way to move between windows
+map <C-j> <C-W>j
+map <C-k> <C-W>k
+map <C-h> <C-W>h
+map <C-l> <C-W>l
+
+" Close the current buffer
+map <leader>bd :Bclose<cr>:tabclose<cr>gT
+
+" Close all the buffers
+map <leader>ba :bufdo bd<cr>
+
+map <leader>l :bnext<cr>
+map <leader>h :bprevious<cr>
+
+" Useful mappings for managing tabs
+map <leader>tn :tabnew<cr>
+map <leader>to :tabonly<cr>
+map <leader>tc :tabclose<cr>
+map <leader>tm :tabmove 
+map <leader>t<leader> :tabnext 
+
+" Let 'tl' toggle between this and the last accessed tab
+let g:lasttab = 1
+nmap <Leader>tl :exe "tabn ".g:lasttab<CR>
+au TabLeave * let g:lasttab = tabpagenr()
+
+
+" Opens a new tab with the current buffer's path
+" Super useful when editing files in the same directory
+map <leader>te :tabedit <C-r>=expand("%:p:h")<cr>/
+
+" Switch CWD to the directory of the open buffer
+map <leader>cd :cd %:p:h<cr>:pwd<cr>
+
+" Specify the behavior when switching between buffers 
+try
+  set switchbuf=useopen,usetab,newtab
+  set stal=2
+catch
+endtry
+
+" Return to last edit position when opening files (You want this!)
+au BufReadPost * if line("'\"") > 1 && line("'\"") <= line("$") | exe "normal! g'\"" | endif
+
+
+""""""""""""""""""""""""""""""
+" => Status line
+""""""""""""""""""""""""""""""
+" Always show the status line
+set laststatus=2
+
+" Format the status line
+set statusline=\ %{HasPaste()}%F%m%r%h\ %w\ \ CWD:\ %r%{getcwd()}%h\ \ \ Line:\ %l\ \ Column:\ %c
+
+
+"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
+" => Editing mappings
+"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
+" Remap VIM 0 to first non-blank character
+map 0 ^
+
+" Move a line of text using ALT+[jk] or Command+[jk] on mac
+nmap <M-j> mz:m+<cr>`z
+nmap <M-k> mz:m-2<cr>`z
+vmap <M-j> :m'>+<cr>`<my`>mzgv`yo`z
+vmap <M-k> :m'<-2<cr>`>my`<mzgv`yo`z
+
+if has("mac") || has("macunix")
+  nmap <D-j> <M-j>
+  nmap <D-k> <M-k>
+  vmap <D-j> <M-j>
+  vmap <D-k> <M-k>
+endif
+
+" Delete trailing white space on save, useful for some filetypes ;)
+fun! CleanExtraSpaces()
+    let save_cursor = getpos(".")
+    let old_query = getreg('/')
+    silent! %s/\s\+$//e
+    call setpos('.', save_cursor)
+    call setreg('/', old_query)
+endfun
+
+if has("autocmd")
+    autocmd BufWritePre *.txt,*.js,*.py,*.wiki,*.sh,*.coffee :call CleanExtraSpaces()
+endif
+
+
+"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
+" => Spell checking
+"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
+" Pressing ,ss will toggle and untoggle spell checking
+map <leader>ss :setlocal spell!<cr>
+
+" Shortcuts using <leader>
+map <leader>sn ]s
+map <leader>sp [s
+map <leader>sa zg
+map <leader>s? z=
+
+
+"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
+" => Misc
+"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
+" Remove the Windows ^M - when the encodings gets messed up
+noremap <Leader>m mmHmt:%s/<C-V><cr>//ge<cr>'tzt'm
+
+" Quickly open a buffer for scribble
+map <leader>q :e ~/buffer<cr>
+
+" Quickly open a markdown buffer for scribble
+map <leader>x :e ~/buffer.md<cr>
+
+" Toggle paste mode on and off
+map <leader>pp :setlocal paste!<cr>
+
+
+"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
+" => Helper functions
+"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
+" Returns true if paste mode is enabled
+function! HasPaste()
+    if &paste
+        return 'PASTE MODE  '
+    endif
+    return ''
+endfunction
+
+" Don't close window, when deleting a buffer
+command! Bclose call <SID>BufcloseCloseIt()
+function! <SID>BufcloseCloseIt()
+    let l:currentBufNum = bufnr("%")
+    let l:alternateBufNum = bufnr("#")
+
+    if buflisted(l:alternateBufNum)
+        buffer #
+    else
+        bnext
+    endif
+
+    if bufnr("%") == l:currentBufNum
+        new
+    endif
+
+    if buflisted(l:currentBufNum)
+        execute("bdelete! ".l:currentBufNum)
+    endif
+endfunction
+
+function! CmdLine(str)
+    call feedkeys(":" . a:str)
+endfunction 
+
+function! VisualSelection(direction, extra_filter) range
+    let l:saved_reg = @"
+    execute "normal! vgvy"
+
+    let l:pattern = escape(@", "\\/.*'$^~[]")
+    let l:pattern = substitute(l:pattern, "\n$", "", "")
+
+    if a:direction == 'gv'
+        call CmdLine("Ack '" . l:pattern . "' " )
+    elseif a:direction == 'replace'
+        call CmdLine("%s" . '/'. l:pattern . '/')
+    endif
+
+    let @/ = l:pattern
+    let @" = l:saved_reg
+endfunction
+
+
+""""""""""""""""""""""""""""""
+" => Python section
+""""""""""""""""""""""""""""""
+let python_highlight_all = 1
+au FileType python syn keyword pythonDecorator True None False self
+
+au BufNewFile,BufRead *.jinja set syntax=htmljinja
+au BufNewFile,BufRead *.mako set ft=mako
+
+au FileType python map <buffer> F :set foldmethod=indent<cr>
+
+au FileType python inoremap <buffer> $r return 
+au FileType python inoremap <buffer> $i import 
+au FileType python inoremap <buffer> $p print 
+au FileType python inoremap <buffer> $f # --- <esc>a
+au FileType python map <buffer> <leader>1 /class 
+au FileType python map <buffer> <leader>2 /def 
+au FileType python map <buffer> <leader>C ?class 
+au FileType python map <buffer> <leader>D ?def 
+
+
+""""""""""""""""""""""""""""""
+" => JavaScript section
+"""""""""""""""""""""""""""""""
+au FileType javascript call JavaScriptFold()
+au FileType javascript setl fen
+au FileType javascript setl nocindent
+
+au FileType javascript imap <C-t> $log();<esc>hi
+au FileType javascript imap <C-a> alert();<esc>hi
+
+au FileType javascript inoremap <buffer> $r return 
+au FileType javascript inoremap <buffer> $f // --- PH<esc>FP2xi
+
+function! JavaScriptFold() 
+    setl foldmethod=syntax
+    setl foldlevelstart=1
+    syn region foldBraces start=/{/ end=/}/ transparent fold keepend extend
+
+    function! FoldText()
+        return substitute(getline(v:foldstart), '{.*', '{...}', '')
+    endfunction
+    setl foldtext=FoldText()
+endfunction
+
+
+""""""""""""""""""""""""""""""
+" => CoffeeScript section
+"""""""""""""""""""""""""""""""
+function! CoffeeScriptFold()
+    setl foldmethod=indent
+    setl foldlevelstart=1
+endfunction
+au FileType coffee call CoffeeScriptFold()
+
+au FileType gitcommit call setpos('.', [0, 1, 1, 0])
+
+
+""""""""""""""""""""""""""""""
+" => Shell section
+""""""""""""""""""""""""""""""
+if exists('$TMUX') 
+    if has('nvim')
+        set termguicolors
+    else
+        set term=screen-256color 
+    endif
+endif
+
+
+""""""""""""""""""""""""""""""
+" => Twig section
+""""""""""""""""""""""""""""""
+autocmd BufRead *.twig set syntax=html filetype=html
+
+
+""""""""""""""""""""""""""""""
+" => Markdown
+""""""""""""""""""""""""""""""
+let vim_markdown_folding_disabled = 1
--- a/README.md
+++ b/README.md
@ -11,7 +11,10 @@

 ## Models

-* [Baidu's Deep Speech2](http://proceedings.mlr.press/v48/amodei16.pdf)
+* [Baidu's DeepSpeech2](http://proceedings.mlr.press/v48/amodei16.pdf)
+* [Transformer](https://arxiv.org/abs/1706.03762)
+* [Conformer](https://arxiv.org/abs/2005.08100)
+* [U2](https://arxiv.org/pdf/2012.05481.pdf)

 ## Setup

@ -22,19 +25,20 @@ Please see [install](docs/install.md).

 ## Getting Started

-Please see [Getting Started](docs/getting_started.md) and [tiny egs](examples/tiny/README.md).
+Please see [Getting Started](docs/src/geting_started.md) and [tiny egs](examples/tiny/README.md).
+

 ## More Information  

-* [Install](docs/install.md)  
-* [Getting Started](docs/getting_started.md)  
-* [Data Prepration](docs/data_preparation.md)  
-* [Data Augmentation](docs/augmentation.md)  
-* [Ngram LM](docs/ngram_lm.md)  
-* [Server Demo](docs/server.md)  
-* [Benchmark](docs/benchmark.md)  
-* [Relased Model](docs/released_model.md)  
-* [FAQ](docs/faq.md)  
+* [Install](docs/src/install.md)  
+* [Getting Started](docs/src/geting_stared.md)  
+* [Data Prepration](docs/src/data_preparation.md)  
+* [Data Augmentation](docs/src/augmentation.md)  
+* [Ngram LM](docs/src/ngram_lm.md)  
+* [Server Demo](docs/src/server.md)  
+* [Benchmark](docs/src/benchmark.md)  
+* [Relased Model](docs/src/released_model.md)  
+* [FAQ](docs/src/faq.md)  


 ## Questions and Help
@ -45,3 +49,7 @@ You are welcome to submit questions in [Github Discussions](https://github.com/P
 ## License

 DeepSpeech is provided under the [Apache-2.0 License](./LICENSE).
+
+## Acknowledgement
+
+We depends on many open source repos. See [References](docs/src/reference.md) for more information.
--- a/README_cn.md
+++ b/README_cn.md
@ -11,7 +11,11 @@

 ## 模型

-* [Baidu's Deep Speech2](http://proceedings.mlr.press/v48/amodei16.pdf)
+* [Baidu's DeepSpeech2](http://proceedings.mlr.press/v48/amodei16.pdf)
+* [Transformer](https://arxiv.org/abs/1706.03762)
+* [Conformer](https://arxiv.org/abs/2005.08100)
+* [U2](https://arxiv.org/pdf/2012.05481.pdf)
+

 ## 安装

@ -22,19 +26,19 @@

 ## 开始

-请查看 [Getting Started](docs/getting_started.md) 和 [tiny egs](examples/tiny/README.md)。
+请查看 [Getting Started](docs/src/geting_started.md) 和 [tiny egs](examples/tiny/README.md)。

 ## 更多信息

-* [安装](docs/install.md)  
-* [开始](docs/getting_started.md)  
-* [数据处理](docs/data_preparation.md)  
-* [数据增强](docs/augmentation.md)  
-* [语言模型](docs/ngram_lm.md)  
-* [服务部署](docs/server.md)  
-* [Benchmark](docs/benchmark.md)  
-* [Relased Model](docs/released_model.md)  
-* [FAQ](docs/faq.md)  
+* [安装](docs/src/install.md)  
+* [开始](docs/src/geting_stared.md)  
+* [数据处理](docs/src/data_preparation.md)  
+* [数据增强](docs/src/augmentation.md)  
+* [语言模型](docs/src/ngram_lm.md)  
+* [服务部署](docs/src/server.md)  
+* [Benchmark](docs/src/benchmark.md)  
+* [Relased Model](docs/src/released_model.md)  
+* [FAQ](docs/src/faq.md)  

 ## 问题和帮助

@ -43,3 +47,7 @@
 ## License

 DeepSpeech遵循[Apache-2.0开源协议](./LICENSE)。
+
+## 感谢
+
+开发中参考一些优秀的仓库，详情参见 [References](docs/src/reference.md)。
--- a/deepspeech/init.py
+++ b/deepspeech/init.py
@ -11,3 +11,478 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
+from typing import Any
+from typing import List
+from typing import Tuple
+from typing import Union
+
+import paddle
+from paddle import nn
+from paddle.fluid import core
+from paddle.nn import functional as F
+
+from deepspeech.utils.log import Log
+
+#TODO(Hui Zhang): remove  fluid import
+logger = Log(__name__).getlog()
+
+########### hcak logging #############
+logger.warn = logger.warning
+
+########### hcak paddle #############
+paddle.bool = 'bool'
+paddle.float16 = 'float16'
+paddle.half = 'float16'
+paddle.float32 = 'float32'
+paddle.float = 'float32'
+paddle.float64 = 'float64'
+paddle.double = 'float64'
+paddle.int8 = 'int8'
+paddle.int16 = 'int16'
+paddle.short = 'int16'
+paddle.int32 = 'int32'
+paddle.int = 'int32'
+paddle.int64 = 'int64'
+paddle.long = 'int64'
+paddle.uint8 = 'uint8'
+paddle.uint16 = 'uint16'
+paddle.complex64 = 'complex64'
+paddle.complex128 = 'complex128'
+paddle.cdouble = 'complex128'
+
+
+def convert_dtype_to_string(tensor_dtype):
+    """
+    Convert the data type in numpy to the data type in Paddle
+    Args:
+        tensor_dtype(core.VarDesc.VarType): the data type in numpy.
+    Returns:
+        core.VarDesc.VarType: the data type in Paddle.
+    """
+    dtype = tensor_dtype
+    if dtype == core.VarDesc.VarType.FP32:
+        return paddle.float32
+    elif dtype == core.VarDesc.VarType.FP64:
+        return paddle.float64
+    elif dtype == core.VarDesc.VarType.FP16:
+        return paddle.float16
+    elif dtype == core.VarDesc.VarType.INT32:
+        return paddle.int32
+    elif dtype == core.VarDesc.VarType.INT16:
+        return paddle.int16
+    elif dtype == core.VarDesc.VarType.INT64:
+        return paddle.int64
+    elif dtype == core.VarDesc.VarType.BOOL:
+        return paddle.bool
+    elif dtype == core.VarDesc.VarType.BF16:
+        # since there is still no support for bfloat16 in NumPy,
+        # uint16 is used for casting bfloat16
+        return paddle.uint16
+    elif dtype == core.VarDesc.VarType.UINT8:
+        return paddle.uint8
+    elif dtype == core.VarDesc.VarType.INT8:
+        return paddle.int8
+    elif dtype == core.VarDesc.VarType.COMPLEX64:
+        return paddle.complex64
+    elif dtype == core.VarDesc.VarType.COMPLEX128:
+        return paddle.complex128
+    else:
+        raise ValueError("Not supported tensor dtype %s" % dtype)
+
+
+if not hasattr(paddle, 'softmax'):
+    logger.warn("register user softmax to paddle, remove this when fixed!")
+    setattr(paddle, 'softmax', paddle.nn.functional.softmax)
+
+if not hasattr(paddle, 'log_softmax'):
+    logger.warn("register user log_softmax to paddle, remove this when fixed!")
+    setattr(paddle, 'log_softmax', paddle.nn.functional.log_softmax)
+
+if not hasattr(paddle, 'sigmoid'):
+    logger.warn("register user sigmoid to paddle, remove this when fixed!")
+    setattr(paddle, 'sigmoid', paddle.nn.functional.sigmoid)
+
+if not hasattr(paddle, 'log_sigmoid'):
+    logger.warn("register user log_sigmoid to paddle, remove this when fixed!")
+    setattr(paddle, 'log_sigmoid', paddle.nn.functional.log_sigmoid)
+
+if not hasattr(paddle, 'relu'):
+    logger.warn("register user relu to paddle, remove this when fixed!")
+    setattr(paddle, 'relu', paddle.nn.functional.relu)
+
+
+def cat(xs, dim=0):
+    return paddle.concat(xs, axis=dim)
+
+
+if not hasattr(paddle, 'cat'):
+    logger.warn(
+        "override cat of paddle if exists or register, remove this when fixed!")
+    paddle.cat = cat
+
+
+########### hcak paddle.Tensor #############
+def item(x: paddle.Tensor):
+    return x.numpy().item()
+
+
+if not hasattr(paddle.Tensor, 'item'):
+    logger.warn(
+        "override item of paddle.Tensor if exists or register, remove this when fixed!"
+    )
+    paddle.Tensor.item = item
+
+
+def func_long(x: paddle.Tensor):
+    return paddle.cast(x, paddle.long)
+
+
+if not hasattr(paddle.Tensor, 'long'):
+    logger.warn(
+        "override long of paddle.Tensor if exists or register, remove this when fixed!"
+    )
+    paddle.Tensor.long = func_long
+
+if not hasattr(paddle.Tensor, 'numel'):
+    logger.warn(
+        "override numel of paddle.Tensor if exists or register, remove this when fixed!"
+    )
+    paddle.Tensor.numel = paddle.numel
+
+
+def new_full(x: paddle.Tensor,
+             size: Union[List[int], Tuple[int], paddle.Tensor],
+             fill_value: Union[float, int, bool, paddle.Tensor],
+             dtype=None):
+    return paddle.full(size, fill_value, dtype=x.dtype)
+
+
+if not hasattr(paddle.Tensor, 'new_full'):
+    logger.warn(
+        "override new_full of paddle.Tensor if exists or register, remove this when fixed!"
+    )
+    paddle.Tensor.new_full = new_full
+
+
+def eq(xs: paddle.Tensor, ys: Union[paddle.Tensor, float]) -> paddle.Tensor:
+    if convert_dtype_to_string(xs.dtype) == paddle.bool:
+        xs = xs.astype(paddle.int)
+    return xs.equal(
+        paddle.to_tensor(
+            ys, dtype=convert_dtype_to_string(xs.dtype), place=xs.place))
+
+
+if not hasattr(paddle.Tensor, 'eq'):
+    logger.warn(
+        "override eq of paddle.Tensor if exists or register, remove this when fixed!"
+    )
+    paddle.Tensor.eq = eq
+
+if not hasattr(paddle, 'eq'):
+    logger.warn(
+        "override eq of paddle if exists or register, remove this when fixed!")
+    paddle.eq = eq
+
+
+def contiguous(xs: paddle.Tensor) -> paddle.Tensor:
+    return xs
+
+
+if not hasattr(paddle.Tensor, 'contiguous'):
+    logger.warn(
+        "override contiguous of paddle.Tensor if exists or register, remove this when fixed!"
+    )
+    paddle.Tensor.contiguous = contiguous
+
+
+def size(xs: paddle.Tensor, *args: int) -> paddle.Tensor:
+    nargs = len(args)
+    assert (nargs <= 1)
+    s = paddle.shape(xs)
+    if nargs == 1:
+        return s[args[0]]
+    else:
+        return s
+
+
+#`to_static` do not process `size` property, maybe some `paddle` api dependent on it.
+logger.warn(
+    "override size of paddle.Tensor "
+    "(`to_static` do not process `size` property, maybe some `paddle` api dependent on it), remove this when fixed!"
+)
+paddle.Tensor.size = size
+
+
+def view(xs: paddle.Tensor, *args: int) -> paddle.Tensor:
+    return xs.reshape(args)
+
+
+if not hasattr(paddle.Tensor, 'view'):
+    logger.warn("register user view to paddle.Tensor, remove this when fixed!")
+    paddle.Tensor.view = view
+
+
+def view_as(xs: paddle.Tensor, ys: paddle.Tensor) -> paddle.Tensor:
+    return xs.reshape(ys.size())
+
+
+if not hasattr(paddle.Tensor, 'view_as'):
+    logger.warn(
+        "register user view_as to paddle.Tensor, remove this when fixed!")
+    paddle.Tensor.view_as = view_as
+
+
+def is_broadcastable(shp1, shp2):
+    for a, b in zip(shp1[::-1], shp2[::-1]):
+        if a == 1 or b == 1 or a == b:
+            pass
+        else:
+            return False
+    return True
+
+
+def masked_fill(xs: paddle.Tensor,
+                mask: paddle.Tensor,
+                value: Union[float, int]):
+    assert is_broadcastable(xs.shape, mask.shape) is True
+    bshape = paddle.broadcast_shape(xs.shape, mask.shape)
+    mask = mask.broadcast_to(bshape)
+    trues = paddle.ones_like(xs) * value
+    xs = paddle.where(mask, trues, xs)
+    return xs
+
+
+if not hasattr(paddle.Tensor, 'masked_fill'):
+    logger.warn(
+        "register user masked_fill to paddle.Tensor, remove this when fixed!")
+    paddle.Tensor.masked_fill = masked_fill
+
+
+def masked_fill_(xs: paddle.Tensor,
+                 mask: paddle.Tensor,
+                 value: Union[float, int]) -> paddle.Tensor:
+    assert is_broadcastable(xs.shape, mask.shape) is True
+    bshape = paddle.broadcast_shape(xs.shape, mask.shape)
+    mask = mask.broadcast_to(bshape)
+    trues = paddle.ones_like(xs) * value
+    ret = paddle.where(mask, trues, xs)
+    paddle.assign(ret.detach(), output=xs)
+    return xs
+
+
+if not hasattr(paddle.Tensor, 'masked_fill_'):
+    logger.warn(
+        "register user masked_fill_ to paddle.Tensor, remove this when fixed!")
+    paddle.Tensor.masked_fill_ = masked_fill_
+
+
+def fill_(xs: paddle.Tensor, value: Union[float, int]) -> paddle.Tensor:
+    val = paddle.full_like(xs, value)
+    paddle.assign(val.detach(), output=xs)
+    return xs
+
+
+if not hasattr(paddle.Tensor, 'fill_'):
+    logger.warn("register user fill_ to paddle.Tensor, remove this when fixed!")
+    paddle.Tensor.fill_ = fill_
+
+
+def repeat(xs: paddle.Tensor, *size: Any) -> paddle.Tensor:
+    return paddle.tile(xs, size)
+
+
+if not hasattr(paddle.Tensor, 'repeat'):
+    logger.warn(
+        "register user repeat to paddle.Tensor, remove this when fixed!")
+    paddle.Tensor.repeat = repeat
+
+if not hasattr(paddle.Tensor, 'softmax'):
+    logger.warn(
+        "register user softmax to paddle.Tensor, remove this when fixed!")
+    setattr(paddle.Tensor, 'softmax', paddle.nn.functional.softmax)
+
+if not hasattr(paddle.Tensor, 'sigmoid'):
+    logger.warn(
+        "register user sigmoid to paddle.Tensor, remove this when fixed!")
+    setattr(paddle.Tensor, 'sigmoid', paddle.nn.functional.sigmoid)
+
+if not hasattr(paddle.Tensor, 'relu'):
+    logger.warn("register user relu to paddle.Tensor, remove this when fixed!")
+    setattr(paddle.Tensor, 'relu', paddle.nn.functional.relu)
+
+
+def type_as(x: paddle.Tensor, other: paddle.Tensor) -> paddle.Tensor:
+    return x.astype(other.dtype)
+
+
+if not hasattr(paddle.Tensor, 'type_as'):
+    logger.warn(
+        "register user type_as to paddle.Tensor, remove this when fixed!")
+    setattr(paddle.Tensor, 'type_as', type_as)
+
+
+def to(x: paddle.Tensor, *args, **kwargs) -> paddle.Tensor:
+    assert len(args) == 1
+    if isinstance(args[0], str):  # dtype
+        return x.astype(args[0])
+    elif isinstance(args[0], paddle.Tensor):  #Tensor
+        return x.astype(args[0].dtype)
+    else:  # Device
+        return x
+
+
+if not hasattr(paddle.Tensor, 'to'):
+    logger.warn("register user to to paddle.Tensor, remove this when fixed!")
+    setattr(paddle.Tensor, 'to', to)
+
+
+def func_float(x: paddle.Tensor) -> paddle.Tensor:
+    return x.astype(paddle.float)
+
+
+if not hasattr(paddle.Tensor, 'float'):
+    logger.warn("register user float to paddle.Tensor, remove this when fixed!")
+    setattr(paddle.Tensor, 'float', func_float)
+
+
+def tolist(x: paddle.Tensor) -> List[Any]:
+    return x.numpy().tolist()
+
+
+if not hasattr(paddle.Tensor, 'tolist'):
+    logger.warn(
+        "register user tolist to paddle.Tensor, remove this when fixed!")
+    setattr(paddle.Tensor, 'tolist', tolist)
+
+########### hcak paddle.nn.functional #############
+
+
+def glu(x: paddle.Tensor, axis=-1) -> paddle.Tensor:
+    """The gated linear unit (GLU) activation."""
+    a, b = x.split(2, axis=axis)
+    act_b = F.sigmoid(b)
+    return a * act_b
+
+
+if not hasattr(paddle.nn.functional, 'glu'):
+    logger.warn(
+        "register user glu to paddle.nn.functional, remove this when fixed!")
+    setattr(paddle.nn.functional, 'glu', glu)
+
+# def softplus(x):
+#     """Softplus function."""
+#     if hasattr(paddle.nn.functional, 'softplus'):
+#         #return paddle.nn.functional.softplus(x.float()).type_as(x)
+#         return paddle.nn.functional.softplus(x)
+#     else:
+#         raise NotImplementedError
+
+# def gelu_accurate(x):
+#     """Gaussian Error Linear Units (GELU) activation."""
+#     # [reference] https://github.com/pytorch/fairseq/blob/e75cff5f2c1d62f12dc911e0bf420025eb1a4e33/fairseq/modules/gelu.py
+#     if not hasattr(gelu_accurate, "_a"):
+#         gelu_accurate._a = math.sqrt(2 / math.pi)
+#     return 0.5 * x * (1 + paddle.tanh(gelu_accurate._a *
+#                                       (x + 0.044715 * paddle.pow(x, 3))))
+
+# def gelu(x):
+#     """Gaussian Error Linear Units (GELU) activation."""
+#     if hasattr(nn.functional, 'gelu'):
+#         #return nn.functional.gelu(x.float()).type_as(x)
+#         return nn.functional.gelu(x)
+#     else:
+#         return x * 0.5 * (1.0 + paddle.erf(x / math.sqrt(2.0)))
+
+
+# hack loss
+def ctc_loss(logits,
+             labels,
+             input_lengths,
+             label_lengths,
+             blank=0,
+             reduction='mean',
+             norm_by_times=True):
+    #logger.info("my ctc loss with norm by times")
+    ## https://github.com/PaddlePaddle/Paddle/blob/f5ca2db2cc/paddle/fluid/operators/warpctc_op.h#L403
+    loss_out = paddle.fluid.layers.warpctc(logits, labels, blank, norm_by_times,
+                                           input_lengths, label_lengths)
+
+    loss_out = paddle.fluid.layers.squeeze(loss_out, [-1])
+    assert reduction in ['mean', 'sum', 'none']
+    if reduction == 'mean':
+        loss_out = paddle.mean(loss_out / label_lengths)
+    elif reduction == 'sum':
+        loss_out = paddle.sum(loss_out)
+    return loss_out
+
+
+logger.warn(
+    "override ctc_loss of paddle.nn.functional if exists, remove this when fixed!"
+)
+F.ctc_loss = ctc_loss
+
+########### hcak paddle.nn #############
+if not hasattr(paddle.nn, 'Module'):
+    logger.warn("register user Module to paddle.nn, remove this when fixed!")
+    setattr(paddle.nn, 'Module', paddle.nn.Layer)
+
+# maybe cause assert isinstance(sublayer, core.Layer)
+if not hasattr(paddle.nn, 'ModuleList'):
+    logger.warn(
+        "register user ModuleList to paddle.nn, remove this when fixed!")
+    setattr(paddle.nn, 'ModuleList', paddle.nn.LayerList)
+
+
+class GLU(nn.Layer):
+    """Gated Linear Units (GLU) Layer"""
+
+    def __init__(self, dim: int=-1):
+        super().__init__()
+        self.dim = dim
+
+    def forward(self, xs):
+        return glu(xs, dim=self.dim)
+
+
+if not hasattr(paddle.nn, 'GLU'):
+    logger.warn("register user GLU to paddle.nn, remove this when fixed!")
+    setattr(paddle.nn, 'GLU', GLU)
+
+
+# TODO(Hui Zhang): remove this Layer
+class ConstantPad2d(nn.Layer):
+    """Pads the input tensor boundaries with a constant value.
+    For N-dimensional padding, use paddle.nn.functional.pad().
+    """
+
+    def __init__(self, padding: Union[tuple, list, int], value: float):
+        """
+        Args:
+            paddle ([tuple]): the size of the padding.
+                If is int, uses the same padding in all boundaries.
+                If a 4-tuple, uses (padding_left, padding_right, padding_top, padding_bottom)
+            value ([flaot]): pad value
+        """
+        self.padding = padding if isinstance(padding,
+                                             [tuple, list]) else [padding] * 4
+        self.value = value
+
+    def forward(self, xs: paddle.Tensor) -> paddle.Tensor:
+        return nn.functional.pad(
+            xs,
+            self.padding,
+            mode='constant',
+            value=self.value,
+            data_format='NCHW')
+
+
+if not hasattr(paddle.nn, 'ConstantPad2d'):
+    logger.warn(
+        "register user ConstantPad2d to paddle.nn, remove this when fixed!")
+    setattr(paddle.nn, 'ConstantPad2d', ConstantPad2d)
+
+########### hcak paddle.jit #############
+
+if not hasattr(paddle.jit, 'export'):
+    logger.warn("register user export to paddle.jit, remove this when fixed!")
+    setattr(paddle.jit, 'export', paddle.jit.to_static)
--- a/deepspeech/decoders/decoders_deprecated.py
+++ b/deepspeech/decoders/decoders_deprecated.py
@ -12,11 +12,11 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 """Contains various CTC decoders."""
-
+import multiprocessing
 from itertools import groupby
-import numpy as np
 from math import log
-import multiprocessing
+
+import numpy as np


 def ctc_greedy_decoder(probs_seq, vocabulary):
@ -104,14 +104,14 @@ def ctc_beam_search_decoder(probs_seq,
        global ext_nproc_scorer
        ext_scoring_func = ext_nproc_scorer

-    ## initialize
+    # initialize
    # prefix_set_prev: the set containing selected prefixes
    # probs_b_prev: prefixes' probability ending with blank in previous step
    # probs_nb_prev: prefixes' probability ending with non-blank in previous step
    prefix_set_prev = {'\t': 1.0}
    probs_b_prev, probs_nb_prev = {'\t': 1.0}, {'\t': 0.0}

-    ## extend prefix in loop
+    # extend prefix in loop
    for time_step in range(len(probs_seq)):
        # prefix_set_next: the set containing candidate prefixes
        # probs_b_cur: prefixes' probability ending with blank in current step
@ -120,7 +120,7 @@ def ctc_beam_search_decoder(probs_seq,

        prob_idx = list(enumerate(probs_seq[time_step]))
        cutoff_len = len(prob_idx)
-        #If pruning is enabled
+        # If pruning is enabled
        if cutoff_prob < 1.0 or cutoff_top_n < cutoff_len:
            prob_idx = sorted(prob_idx, key=lambda asd: asd[1], reverse=True)
            cutoff_len, cum_prob = 0, 0.0
@ -172,7 +172,7 @@ def ctc_beam_search_decoder(probs_seq,
        # update probs
        probs_b_prev, probs_nb_prev = probs_b_cur, probs_nb_cur

-        ## store top beam_size prefixes
+        # store top beam_size prefixes
        prefix_set_prev = sorted(
            prefix_set_next.items(), key=lambda asd: asd[1], reverse=True)
        if beam_size < len(prefix_set_prev):
@ -191,7 +191,7 @@ def ctc_beam_search_decoder(probs_seq,
        else:
            beam_result.append((float('-inf'), ''))

-    ## output top beam_size decoding results
+    # output top beam_size decoding results
    beam_result = sorted(beam_result, key=lambda asd: asd[0], reverse=True)
    return beam_result

--- a/deepspeech/decoders/scorer_deprecated.py
+++ b/deepspeech/decoders/scorer_deprecated.py
@ -12,8 +12,8 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 """External Scorer for Beam Search Decoder."""
-
 import os
+
 import kenlm
 import numpy as np

@ -71,7 +71,7 @@ class Scorer(object):
        """
        lm = self._language_model_score(sentence)
        word_cnt = self._word_count(sentence)
-        if log == False:
+        if log is False:
            score = np.power(lm, self._alpha) * np.power(word_cnt, self._beta)
        else:
            score = self._alpha * np.log(lm) + self._beta * np.log(word_cnt)
--- a/deepspeech/decoders/swig/ctc_beam_search_decoder.cpp
+++ b/deepspeech/decoders/swig/ctc_beam_search_decoder.cpp
@ -36,167 +36,177 @@ std::vector<std::pair<double, std::string>> ctc_beam_search_decoder(
    double cutoff_prob,
    size_t cutoff_top_n,
    Scorer *ext_scorer) {
-  // dimension check
-  size_t num_time_steps = probs_seq.size();
-  for (size_t i = 0; i < num_time_steps; ++i) {
-    VALID_CHECK_EQ(probs_seq[i].size(),
-                   vocabulary.size() + 1,
-                   "The shape of probs_seq does not match with "
-                   "the shape of the vocabulary");
-  }
-
-  // assign blank id
-  size_t blank_id = vocabulary.size();
-
-  // assign space id
-  auto it = std::find(vocabulary.begin(), vocabulary.end(), " ");
-  int space_id = it - vocabulary.begin();
-  // if no space in vocabulary
-  if ((size_t)space_id >= vocabulary.size()) {
-    space_id = -2;
-  }
-
-  // init prefixes' root
-  PathTrie root;
-  root.score = root.log_prob_b_prev = 0.0;
-  std::vector<PathTrie *> prefixes;
-  prefixes.push_back(&root);
-
-  if (ext_scorer != nullptr && !ext_scorer->is_character_based()) {
-    auto fst_dict = static_cast<fst::StdVectorFst *>(ext_scorer->dictionary);
-    fst::StdVectorFst *dict_ptr = fst_dict->Copy(true);
-    root.set_dictionary(dict_ptr);
-    auto matcher = std::make_shared<FSTMATCH>(*dict_ptr, fst::MATCH_INPUT);
-    root.set_matcher(matcher);
-  }
-
-  // prefix search over time
-  for (size_t time_step = 0; time_step < num_time_steps; ++time_step) {
-    auto &prob = probs_seq[time_step];
-
-    float min_cutoff = -NUM_FLT_INF;
-    bool full_beam = false;
-    if (ext_scorer != nullptr) {
-      size_t num_prefixes = std::min(prefixes.size(), beam_size);
-      std::sort(
-          prefixes.begin(), prefixes.begin() + num_prefixes, prefix_compare);
-      min_cutoff = prefixes[num_prefixes - 1]->score +
-                   std::log(prob[blank_id]) - std::max(0.0, ext_scorer->beta);
-      full_beam = (num_prefixes == beam_size);
+    // dimension check
+    size_t num_time_steps = probs_seq.size();
+    for (size_t i = 0; i < num_time_steps; ++i) {
+        VALID_CHECK_EQ(probs_seq[i].size(),
+                       // vocabulary.size() + 1,
+                       vocabulary.size(),
+                       "The shape of probs_seq does not match with "
+                       "the shape of the vocabulary");
    }

-    std::vector<std::pair<size_t, float>> log_prob_idx =
-        get_pruned_log_probs(prob, cutoff_prob, cutoff_top_n);
-    // loop over chars
-    for (size_t index = 0; index < log_prob_idx.size(); index++) {
-      auto c = log_prob_idx[index].first;
-      auto log_prob_c = log_prob_idx[index].second;
-
-      for (size_t i = 0; i < prefixes.size() && i < beam_size; ++i) {
-        auto prefix = prefixes[i];
-        if (full_beam && log_prob_c + prefix->score < min_cutoff) {
-          break;
-        }
-        // blank
-        if (c == blank_id) {
-          prefix->log_prob_b_cur =
-              log_sum_exp(prefix->log_prob_b_cur, log_prob_c + prefix->score);
-          continue;
+    // assign blank id
+    // size_t blank_id = vocabulary.size();
+    size_t blank_id = 0;
+
+    // assign space id
+    auto it = std::find(vocabulary.begin(), vocabulary.end(), " ");
+    int space_id = it - vocabulary.begin();
+    // if no space in vocabulary
+    if ((size_t)space_id >= vocabulary.size()) {
+        space_id = -2;
+    }
+
+    // init prefixes' root
+    PathTrie root;
+    root.score = root.log_prob_b_prev = 0.0;
+    std::vector<PathTrie *> prefixes;
+    prefixes.push_back(&root);
+
+    if (ext_scorer != nullptr && !ext_scorer->is_character_based()) {
+        auto fst_dict =
+            static_cast<fst::StdVectorFst *>(ext_scorer->dictionary);
+        fst::StdVectorFst *dict_ptr = fst_dict->Copy(true);
+        root.set_dictionary(dict_ptr);
+        auto matcher = std::make_shared<FSTMATCH>(*dict_ptr, fst::MATCH_INPUT);
+        root.set_matcher(matcher);
+    }
+
+    // prefix search over time
+    for (size_t time_step = 0; time_step < num_time_steps; ++time_step) {
+        auto &prob = probs_seq[time_step];
+
+        float min_cutoff = -NUM_FLT_INF;
+        bool full_beam = false;
+        if (ext_scorer != nullptr) {
+            size_t num_prefixes = std::min(prefixes.size(), beam_size);
+            std::sort(prefixes.begin(),
+                      prefixes.begin() + num_prefixes,
+                      prefix_compare);
+            min_cutoff = prefixes[num_prefixes - 1]->score +
+                         std::log(prob[blank_id]) -
+                         std::max(0.0, ext_scorer->beta);
+            full_beam = (num_prefixes == beam_size);
        }
-        // repeated character
-        if (c == prefix->character) {
-          prefix->log_prob_nb_cur = log_sum_exp(
-              prefix->log_prob_nb_cur, log_prob_c + prefix->log_prob_nb_prev);
+
+        std::vector<std::pair<size_t, float>> log_prob_idx =
+            get_pruned_log_probs(prob, cutoff_prob, cutoff_top_n);
+        // loop over chars
+        for (size_t index = 0; index < log_prob_idx.size(); index++) {
+            auto c = log_prob_idx[index].first;
+            auto log_prob_c = log_prob_idx[index].second;
+
+            for (size_t i = 0; i < prefixes.size() && i < beam_size; ++i) {
+                auto prefix = prefixes[i];
+                if (full_beam && log_prob_c + prefix->score < min_cutoff) {
+                    break;
+                }
+                // blank
+                if (c == blank_id) {
+                    prefix->log_prob_b_cur = log_sum_exp(
+                        prefix->log_prob_b_cur, log_prob_c + prefix->score);
+                    continue;
+                }
+                // repeated character
+                if (c == prefix->character) {
+                    prefix->log_prob_nb_cur =
+                        log_sum_exp(prefix->log_prob_nb_cur,
+                                    log_prob_c + prefix->log_prob_nb_prev);
+                }
+                // get new prefix
+                auto prefix_new = prefix->get_path_trie(c);
+
+                if (prefix_new != nullptr) {
+                    float log_p = -NUM_FLT_INF;
+
+                    if (c == prefix->character &&
+                        prefix->log_prob_b_prev > -NUM_FLT_INF) {
+                        log_p = log_prob_c + prefix->log_prob_b_prev;
+                    } else if (c != prefix->character) {
+                        log_p = log_prob_c + prefix->score;
+                    }
+
+                    // language model scoring
+                    if (ext_scorer != nullptr &&
+                        (c == space_id || ext_scorer->is_character_based())) {
+                        PathTrie *prefix_to_score = nullptr;
+                        // skip scoring the space
+                        if (ext_scorer->is_character_based()) {
+                            prefix_to_score = prefix_new;
+                        } else {
+                            prefix_to_score = prefix;
+                        }
+
+                        float score = 0.0;
+                        std::vector<std::string> ngram;
+                        ngram = ext_scorer->make_ngram(prefix_to_score);
+                        score = ext_scorer->get_log_cond_prob(ngram) *
+                                ext_scorer->alpha;
+                        log_p += score;
+                        log_p += ext_scorer->beta;
+                    }
+                    prefix_new->log_prob_nb_cur =
+                        log_sum_exp(prefix_new->log_prob_nb_cur, log_p);
+                }
+            }  // end of loop over prefix
+        }      // end of loop over vocabulary
+
+
+        prefixes.clear();
+        // update log probs
+        root.iterate_to_vec(prefixes);
+
+        // only preserve top beam_size prefixes
+        if (prefixes.size() >= beam_size) {
+            std::nth_element(prefixes.begin(),
+                             prefixes.begin() + beam_size,
+                             prefixes.end(),
+                             prefix_compare);
+            for (size_t i = beam_size; i < prefixes.size(); ++i) {
+                prefixes[i]->remove();
+            }
        }
-        // get new prefix
-        auto prefix_new = prefix->get_path_trie(c);
-
-        if (prefix_new != nullptr) {
-          float log_p = -NUM_FLT_INF;
-
-          if (c == prefix->character &&
-              prefix->log_prob_b_prev > -NUM_FLT_INF) {
-            log_p = log_prob_c + prefix->log_prob_b_prev;
-          } else if (c != prefix->character) {
-            log_p = log_prob_c + prefix->score;
-          }
-
-          // language model scoring
-          if (ext_scorer != nullptr &&
-              (c == space_id || ext_scorer->is_character_based())) {
-            PathTrie *prefix_to_score = nullptr;
-            // skip scoring the space
-            if (ext_scorer->is_character_based()) {
-              prefix_to_score = prefix_new;
-            } else {
-              prefix_to_score = prefix;
+    }  // end of loop over time
+
+    // score the last word of each prefix that doesn't end with space
+    if (ext_scorer != nullptr && !ext_scorer->is_character_based()) {
+        for (size_t i = 0; i < beam_size && i < prefixes.size(); ++i) {
+            auto prefix = prefixes[i];
+            if (!prefix->is_empty() && prefix->character != space_id) {
+                float score = 0.0;
+                std::vector<std::string> ngram = ext_scorer->make_ngram(prefix);
+                score =
+                    ext_scorer->get_log_cond_prob(ngram) * ext_scorer->alpha;
+                score += ext_scorer->beta;
+                prefix->score += score;
            }
-
-            float score = 0.0;
-            std::vector<std::string> ngram;
-            ngram = ext_scorer->make_ngram(prefix_to_score);
-            score = ext_scorer->get_log_cond_prob(ngram) * ext_scorer->alpha;
-            log_p += score;
-            log_p += ext_scorer->beta;
-          }
-          prefix_new->log_prob_nb_cur =
-              log_sum_exp(prefix_new->log_prob_nb_cur, log_p);
        }
-      }  // end of loop over prefix
-    }    // end of loop over vocabulary
-
-
-    prefixes.clear();
-    // update log probs
-    root.iterate_to_vec(prefixes);
-
-    // only preserve top beam_size prefixes
-    if (prefixes.size() >= beam_size) {
-      std::nth_element(prefixes.begin(),
-                       prefixes.begin() + beam_size,
-                       prefixes.end(),
-                       prefix_compare);
-      for (size_t i = beam_size; i < prefixes.size(); ++i) {
-        prefixes[i]->remove();
-      }
    }
-  }  // end of loop over time

-  // score the last word of each prefix that doesn't end with space
-  if (ext_scorer != nullptr && !ext_scorer->is_character_based()) {
+    size_t num_prefixes = std::min(prefixes.size(), beam_size);
+    std::sort(
+        prefixes.begin(), prefixes.begin() + num_prefixes, prefix_compare);
+
+    // compute aproximate ctc score as the return score, without affecting the
+    // return order of decoding result. To delete when decoder gets stable.
    for (size_t i = 0; i < beam_size && i < prefixes.size(); ++i) {
-      auto prefix = prefixes[i];
-      if (!prefix->is_empty() && prefix->character != space_id) {
-        float score = 0.0;
-        std::vector<std::string> ngram = ext_scorer->make_ngram(prefix);
-        score = ext_scorer->get_log_cond_prob(ngram) * ext_scorer->alpha;
-        score += ext_scorer->beta;
-        prefix->score += score;
-      }
-    }
-  }
-
-  size_t num_prefixes = std::min(prefixes.size(), beam_size);
-  std::sort(prefixes.begin(), prefixes.begin() + num_prefixes, prefix_compare);
-
-  // compute aproximate ctc score as the return score, without affecting the
-  // return order of decoding result. To delete when decoder gets stable.
-  for (size_t i = 0; i < beam_size && i < prefixes.size(); ++i) {
-    double approx_ctc = prefixes[i]->score;
-    if (ext_scorer != nullptr) {
-      std::vector<int> output;
-      prefixes[i]->get_path_vec(output);
-      auto prefix_length = output.size();
-      auto words = ext_scorer->split_labels(output);
-      // remove word insert
-      approx_ctc = approx_ctc - prefix_length * ext_scorer->beta;
-      // remove language model weight:
-      approx_ctc -= (ext_scorer->get_sent_log_prob(words)) * ext_scorer->alpha;
+        double approx_ctc = prefixes[i]->score;
+        if (ext_scorer != nullptr) {
+            std::vector<int> output;
+            prefixes[i]->get_path_vec(output);
+            auto prefix_length = output.size();
+            auto words = ext_scorer->split_labels(output);
+            // remove word insert
+            approx_ctc = approx_ctc - prefix_length * ext_scorer->beta;
+            // remove language model weight:
+            approx_ctc -=
+                (ext_scorer->get_sent_log_prob(words)) * ext_scorer->alpha;
+        }
+        prefixes[i]->approx_ctc = approx_ctc;
    }
-    prefixes[i]->approx_ctc = approx_ctc;
-  }

-  return get_beam_search_result(prefixes, vocabulary, beam_size);
+    return get_beam_search_result(prefixes, vocabulary, beam_size);
 }


@ -209,28 +219,28 @@ ctc_beam_search_decoder_batch(
    double cutoff_prob,
    size_t cutoff_top_n,
    Scorer *ext_scorer) {
-  VALID_CHECK_GT(num_processes, 0, "num_processes must be nonnegative!");
-  // thread pool
-  ThreadPool pool(num_processes);
-  // number of samples
-  size_t batch_size = probs_split.size();
-
-  // enqueue the tasks of decoding
-  std::vector<std::future<std::vector<std::pair<double, std::string>>>> res;
-  for (size_t i = 0; i < batch_size; ++i) {
-    res.emplace_back(pool.enqueue(ctc_beam_search_decoder,
-                                  probs_split[i],
-                                  vocabulary,
-                                  beam_size,
-                                  cutoff_prob,
-                                  cutoff_top_n,
-                                  ext_scorer));
-  }
-
-  // get decoding results
-  std::vector<std::vector<std::pair<double, std::string>>> batch_results;
-  for (size_t i = 0; i < batch_size; ++i) {
-    batch_results.emplace_back(res[i].get());
-  }
-  return batch_results;
+    VALID_CHECK_GT(num_processes, 0, "num_processes must be nonnegative!");
+    // thread pool
+    ThreadPool pool(num_processes);
+    // number of samples
+    size_t batch_size = probs_split.size();
+
+    // enqueue the tasks of decoding
+    std::vector<std::future<std::vector<std::pair<double, std::string>>>> res;
+    for (size_t i = 0; i < batch_size; ++i) {
+        res.emplace_back(pool.enqueue(ctc_beam_search_decoder,
+                                      probs_split[i],
+                                      vocabulary,
+                                      beam_size,
+                                      cutoff_prob,
+                                      cutoff_top_n,
+                                      ext_scorer));
+    }
+
+    // get decoding results
+    std::vector<std::vector<std::pair<double, std::string>>> batch_results;
+    for (size_t i = 0; i < batch_size; ++i) {
+        batch_results.emplace_back(res[i].get());
+    }
+    return batch_results;
 }
--- a/deepspeech/decoders/swig/ctc_greedy_decoder.cpp
+++ b/deepspeech/decoders/swig/ctc_greedy_decoder.cpp
@ -18,42 +18,42 @@
 std::string ctc_greedy_decoder(
    const std::vector<std::vector<double>> &probs_seq,
    const std::vector<std::string> &vocabulary) {
-  // dimension check
-  size_t num_time_steps = probs_seq.size();
-  for (size_t i = 0; i < num_time_steps; ++i) {
-    VALID_CHECK_EQ(probs_seq[i].size(),
-                   vocabulary.size() + 1,
-                   "The shape of probs_seq does not match with "
-                   "the shape of the vocabulary");
-  }
+    // dimension check
+    size_t num_time_steps = probs_seq.size();
+    for (size_t i = 0; i < num_time_steps; ++i) {
+        VALID_CHECK_EQ(probs_seq[i].size(),
+                       vocabulary.size() + 1,
+                       "The shape of probs_seq does not match with "
+                       "the shape of the vocabulary");
+    }

-  size_t blank_id = vocabulary.size();
+    size_t blank_id = vocabulary.size();

-  std::vector<size_t> max_idx_vec(num_time_steps, 0);
-  std::vector<size_t> idx_vec;
-  for (size_t i = 0; i < num_time_steps; ++i) {
-    double max_prob = 0.0;
-    size_t max_idx = 0;
-    const std::vector<double> &probs_step = probs_seq[i];
-    for (size_t j = 0; j < probs_step.size(); ++j) {
-      if (max_prob < probs_step[j]) {
-        max_idx = j;
-        max_prob = probs_step[j];
-      }
-    }
-    // id with maximum probability in current time step
-    max_idx_vec[i] = max_idx;
-    // deduplicate
-    if ((i == 0) || ((i > 0) && max_idx_vec[i] != max_idx_vec[i - 1])) {
-      idx_vec.push_back(max_idx_vec[i]);
+    std::vector<size_t> max_idx_vec(num_time_steps, 0);
+    std::vector<size_t> idx_vec;
+    for (size_t i = 0; i < num_time_steps; ++i) {
+        double max_prob = 0.0;
+        size_t max_idx = 0;
+        const std::vector<double> &probs_step = probs_seq[i];
+        for (size_t j = 0; j < probs_step.size(); ++j) {
+            if (max_prob < probs_step[j]) {
+                max_idx = j;
+                max_prob = probs_step[j];
+            }
+        }
+        // id with maximum probability in current time step
+        max_idx_vec[i] = max_idx;
+        // deduplicate
+        if ((i == 0) || ((i > 0) && max_idx_vec[i] != max_idx_vec[i - 1])) {
+            idx_vec.push_back(max_idx_vec[i]);
+        }
    }
-  }

-  std::string best_path_result;
-  for (size_t i = 0; i < idx_vec.size(); ++i) {
-    if (idx_vec[i] != blank_id) {
-      best_path_result += vocabulary[idx_vec[i]];
+    std::string best_path_result;
+    for (size_t i = 0; i < idx_vec.size(); ++i) {
+        if (idx_vec[i] != blank_id) {
+            best_path_result += vocabulary[idx_vec[i]];
+        }
    }
-  }
-  return best_path_result;
+    return best_path_result;
 }
--- a/deepspeech/decoders/swig/decoder_utils.cpp
+++ b/deepspeech/decoders/swig/decoder_utils.cpp
@ -22,33 +22,35 @@ std::vector<std::pair<size_t, float>> get_pruned_log_probs(
    const std::vector<double> &prob_step,
    double cutoff_prob,
    size_t cutoff_top_n) {
-  std::vector<std::pair<int, double>> prob_idx;
-  for (size_t i = 0; i < prob_step.size(); ++i) {
-    prob_idx.push_back(std::pair<int, double>(i, prob_step[i]));
-  }
-  // pruning of vacobulary
-  size_t cutoff_len = prob_step.size();
-  if (cutoff_prob < 1.0 || cutoff_top_n < cutoff_len) {
-    std::sort(
-        prob_idx.begin(), prob_idx.end(), pair_comp_second_rev<int, double>);
-    if (cutoff_prob < 1.0) {
-      double cum_prob = 0.0;
-      cutoff_len = 0;
-      for (size_t i = 0; i < prob_idx.size(); ++i) {
-        cum_prob += prob_idx[i].second;
-        cutoff_len += 1;
-        if (cum_prob >= cutoff_prob || cutoff_len >= cutoff_top_n) break;
-      }
+    std::vector<std::pair<int, double>> prob_idx;
+    for (size_t i = 0; i < prob_step.size(); ++i) {
+        prob_idx.push_back(std::pair<int, double>(i, prob_step[i]));
    }
-    prob_idx = std::vector<std::pair<int, double>>(
-        prob_idx.begin(), prob_idx.begin() + cutoff_len);
-  }
-  std::vector<std::pair<size_t, float>> log_prob_idx;
-  for (size_t i = 0; i < cutoff_len; ++i) {
-    log_prob_idx.push_back(std::pair<int, float>(
-        prob_idx[i].first, log(prob_idx[i].second + NUM_FLT_MIN)));
-  }
-  return log_prob_idx;
+    // pruning of vacobulary
+    size_t cutoff_len = prob_step.size();
+    if (cutoff_prob < 1.0 || cutoff_top_n < cutoff_len) {
+        std::sort(prob_idx.begin(),
+                  prob_idx.end(),
+                  pair_comp_second_rev<int, double>);
+        if (cutoff_prob < 1.0) {
+            double cum_prob = 0.0;
+            cutoff_len = 0;
+            for (size_t i = 0; i < prob_idx.size(); ++i) {
+                cum_prob += prob_idx[i].second;
+                cutoff_len += 1;
+                if (cum_prob >= cutoff_prob || cutoff_len >= cutoff_top_n)
+                    break;
+            }
+        }
+        prob_idx = std::vector<std::pair<int, double>>(
+            prob_idx.begin(), prob_idx.begin() + cutoff_len);
+    }
+    std::vector<std::pair<size_t, float>> log_prob_idx;
+    for (size_t i = 0; i < cutoff_len; ++i) {
+        log_prob_idx.push_back(std::pair<int, float>(
+            prob_idx[i].first, log(prob_idx[i].second + NUM_FLT_MIN)));
+    }
+    return log_prob_idx;
 }


@ -56,106 +58,106 @@ std::vector<std::pair<double, std::string>> get_beam_search_result(
    const std::vector<PathTrie *> &prefixes,
    const std::vector<std::string> &vocabulary,
    size_t beam_size) {
-  // allow for the post processing
-  std::vector<PathTrie *> space_prefixes;
-  if (space_prefixes.empty()) {
-    for (size_t i = 0; i < beam_size && i < prefixes.size(); ++i) {
-      space_prefixes.push_back(prefixes[i]);
+    // allow for the post processing
+    std::vector<PathTrie *> space_prefixes;
+    if (space_prefixes.empty()) {
+        for (size_t i = 0; i < beam_size && i < prefixes.size(); ++i) {
+            space_prefixes.push_back(prefixes[i]);
+        }
    }
-  }
-
-  std::sort(space_prefixes.begin(), space_prefixes.end(), prefix_compare);
-  std::vector<std::pair<double, std::string>> output_vecs;
-  for (size_t i = 0; i < beam_size && i < space_prefixes.size(); ++i) {
-    std::vector<int> output;
-    space_prefixes[i]->get_path_vec(output);
-    // convert index to string
-    std::string output_str;
-    for (size_t j = 0; j < output.size(); j++) {
-      output_str += vocabulary[output[j]];
+
+    std::sort(space_prefixes.begin(), space_prefixes.end(), prefix_compare);
+    std::vector<std::pair<double, std::string>> output_vecs;
+    for (size_t i = 0; i < beam_size && i < space_prefixes.size(); ++i) {
+        std::vector<int> output;
+        space_prefixes[i]->get_path_vec(output);
+        // convert index to string
+        std::string output_str;
+        for (size_t j = 0; j < output.size(); j++) {
+            output_str += vocabulary[output[j]];
+        }
+        std::pair<double, std::string> output_pair(
+            -space_prefixes[i]->approx_ctc, output_str);
+        output_vecs.emplace_back(output_pair);
    }
-    std::pair<double, std::string> output_pair(-space_prefixes[i]->approx_ctc,
-                                               output_str);
-    output_vecs.emplace_back(output_pair);
-  }

-  return output_vecs;
+    return output_vecs;
 }

 size_t get_utf8_str_len(const std::string &str) {
-  size_t str_len = 0;
-  for (char c : str) {
-    str_len += ((c & 0xc0) != 0x80);
-  }
-  return str_len;
+    size_t str_len = 0;
+    for (char c : str) {
+        str_len += ((c & 0xc0) != 0x80);
+    }
+    return str_len;
 }

 std::vector<std::string> split_utf8_str(const std::string &str) {
-  std::vector<std::string> result;
-  std::string out_str;
-
-  for (char c : str) {
-    if ((c & 0xc0) != 0x80)  // new UTF-8 character
-    {
-      if (!out_str.empty()) {
-        result.push_back(out_str);
-        out_str.clear();
-      }
+    std::vector<std::string> result;
+    std::string out_str;
+
+    for (char c : str) {
+        if ((c & 0xc0) != 0x80)  // new UTF-8 character
+        {
+            if (!out_str.empty()) {
+                result.push_back(out_str);
+                out_str.clear();
+            }
+        }
+
+        out_str.append(1, c);
    }
-
-    out_str.append(1, c);
-  }
-  result.push_back(out_str);
-  return result;
+    result.push_back(out_str);
+    return result;
 }

 std::vector<std::string> split_str(const std::string &s,
                                   const std::string &delim) {
-  std::vector<std::string> result;
-  std::size_t start = 0, delim_len = delim.size();
-  while (true) {
-    std::size_t end = s.find(delim, start);
-    if (end == std::string::npos) {
-      if (start < s.size()) {
-        result.push_back(s.substr(start));
-      }
-      break;
-    }
-    if (end > start) {
-      result.push_back(s.substr(start, end - start));
+    std::vector<std::string> result;
+    std::size_t start = 0, delim_len = delim.size();
+    while (true) {
+        std::size_t end = s.find(delim, start);
+        if (end == std::string::npos) {
+            if (start < s.size()) {
+                result.push_back(s.substr(start));
+            }
+            break;
+        }
+        if (end > start) {
+            result.push_back(s.substr(start, end - start));
+        }
+        start = end + delim_len;
    }
-    start = end + delim_len;
-  }
-  return result;
+    return result;
 }

 bool prefix_compare(const PathTrie *x, const PathTrie *y) {
-  if (x->score == y->score) {
-    if (x->character == y->character) {
-      return false;
+    if (x->score == y->score) {
+        if (x->character == y->character) {
+            return false;
+        } else {
+            return (x->character < y->character);
+        }
    } else {
-      return (x->character < y->character);
+        return x->score > y->score;
    }
-  } else {
-    return x->score > y->score;
-  }
 }

 void add_word_to_fst(const std::vector<int> &word,
                     fst::StdVectorFst *dictionary) {
-  if (dictionary->NumStates() == 0) {
-    fst::StdVectorFst::StateId start = dictionary->AddState();
-    assert(start == 0);
-    dictionary->SetStart(start);
-  }
-  fst::StdVectorFst::StateId src = dictionary->Start();
-  fst::StdVectorFst::StateId dst;
-  for (auto c : word) {
-    dst = dictionary->AddState();
-    dictionary->AddArc(src, fst::StdArc(c, c, 0, dst));
-    src = dst;
-  }
-  dictionary->SetFinal(dst, fst::StdArc::Weight::One());
+    if (dictionary->NumStates() == 0) {
+        fst::StdVectorFst::StateId start = dictionary->AddState();
+        assert(start == 0);
+        dictionary->SetStart(start);
+    }
+    fst::StdVectorFst::StateId src = dictionary->Start();
+    fst::StdVectorFst::StateId dst;
+    for (auto c : word) {
+        dst = dictionary->AddState();
+        dictionary->AddArc(src, fst::StdArc(c, c, 0, dst));
+        src = dst;
+    }
+    dictionary->SetFinal(dst, fst::StdArc::Weight::One());
 }

 bool add_word_to_dictionary(
@ -164,27 +166,27 @@ bool add_word_to_dictionary(
    bool add_space,
    int SPACE_ID,
    fst::StdVectorFst *dictionary) {
-  auto characters = split_utf8_str(word);
-
-  std::vector<int> int_word;
-
-  for (auto &c : characters) {
-    if (c == " ") {
-      int_word.push_back(SPACE_ID);
-    } else {
-      auto int_c = char_map.find(c);
-      if (int_c != char_map.end()) {
-        int_word.push_back(int_c->second);
-      } else {
-        return false;  // return without adding
-      }
+    auto characters = split_utf8_str(word);
+
+    std::vector<int> int_word;
+
+    for (auto &c : characters) {
+        if (c == " ") {
+            int_word.push_back(SPACE_ID);
+        } else {
+            auto int_c = char_map.find(c);
+            if (int_c != char_map.end()) {
+                int_word.push_back(int_c->second);
+            } else {
+                return false;  // return without adding
+            }
+        }
    }
-  }

-  if (add_space) {
-    int_word.push_back(SPACE_ID);
-  }
+    if (add_space) {
+        int_word.push_back(SPACE_ID);
+    }

-  add_word_to_fst(int_word, dictionary);
-  return true;  // return with successful adding
+    add_word_to_fst(int_word, dictionary);
+    return true;  // return with successful adding
 }
--- a/deepspeech/decoders/swig/decoder_utils.h
+++ b/deepspeech/decoders/swig/decoder_utils.h
@ -25,14 +25,14 @@ const float NUM_FLT_MIN = std::numeric_limits<float>::min();
 // inline function for validation check
 inline void check(
    bool x, const char *expr, const char *file, int line, const char *err) {
-  if (!x) {
-    std::cout << "[" << file << ":" << line << "] ";
-    LOG(FATAL) << "\"" << expr << "\" check failed. " << err;
-  }
+    if (!x) {
+        std::cout << "[" << file << ":" << line << "] ";
+        LOG(FATAL) << "\"" << expr << "\" check failed. " << err;
+    }
 }

 #define VALID_CHECK(x, info) \
-  check(static_cast<bool>(x), #x, __FILE__, __LINE__, info)
+    check(static_cast<bool>(x), #x, __FILE__, __LINE__, info)
 #define VALID_CHECK_EQ(x, y, info) VALID_CHECK((x) == (y), info)
 #define VALID_CHECK_GT(x, y, info) VALID_CHECK((x) > (y), info)
 #define VALID_CHECK_LT(x, y, info) VALID_CHECK((x) < (y), info)
@ -42,24 +42,24 @@ inline void check(
 template <typename T1, typename T2>
 bool pair_comp_first_rev(const std::pair<T1, T2> &a,
                         const std::pair<T1, T2> &b) {
-  return a.first > b.first;
+    return a.first > b.first;
 }

 // Function template for comparing two pairs
 template <typename T1, typename T2>
 bool pair_comp_second_rev(const std::pair<T1, T2> &a,
                          const std::pair<T1, T2> &b) {
-  return a.second > b.second;
+    return a.second > b.second;
 }

 // Return the sum of two probabilities in log scale
 template <typename T>
 T log_sum_exp(const T &x, const T &y) {
-  static T num_min = -std::numeric_limits<T>::max();
-  if (x <= num_min) return y;
-  if (y <= num_min) return x;
-  T xmax = std::max(x, y);
-  return std::log(std::exp(x - xmax) + std::exp(y - xmax)) + xmax;
+    static T num_min = -std::numeric_limits<T>::max();
+    if (x <= num_min) return y;
+    if (y <= num_min) return x;
+    T xmax = std::max(x, y);
+    return std::log(std::exp(x - xmax) + std::exp(y - xmax)) + xmax;
 }

 // Get pruned probability vector for each time step's beam search
--- a/deepspeech/decoders/swig/path_trie.cpp
+++ b/deepspeech/decoders/swig/path_trie.cpp
@ -23,140 +23,141 @@
 #include "decoder_utils.h"

 PathTrie::PathTrie() {
-  log_prob_b_prev = -NUM_FLT_INF;
-  log_prob_nb_prev = -NUM_FLT_INF;
-  log_prob_b_cur = -NUM_FLT_INF;
-  log_prob_nb_cur = -NUM_FLT_INF;
-  score = -NUM_FLT_INF;
-
-  ROOT_ = -1;
-  character = ROOT_;
-  exists_ = true;
-  parent = nullptr;
-
-  dictionary_ = nullptr;
-  dictionary_state_ = 0;
-  has_dictionary_ = false;
-
-  matcher_ = nullptr;
+    log_prob_b_prev = -NUM_FLT_INF;
+    log_prob_nb_prev = -NUM_FLT_INF;
+    log_prob_b_cur = -NUM_FLT_INF;
+    log_prob_nb_cur = -NUM_FLT_INF;
+    score = -NUM_FLT_INF;
+
+    ROOT_ = -1;
+    character = ROOT_;
+    exists_ = true;
+    parent = nullptr;
+
+    dictionary_ = nullptr;
+    dictionary_state_ = 0;
+    has_dictionary_ = false;
+
+    matcher_ = nullptr;
 }

 PathTrie::~PathTrie() {
-  for (auto child : children_) {
-    delete child.second;
-  }
+    for (auto child : children_) {
+        delete child.second;
+    }
 }

 PathTrie* PathTrie::get_path_trie(int new_char, bool reset) {
-  auto child = children_.begin();
-  for (child = children_.begin(); child != children_.end(); ++child) {
-    if (child->first == new_char) {
-      break;
-    }
-  }
-  if (child != children_.end()) {
-    if (!child->second->exists_) {
-      child->second->exists_ = true;
-      child->second->log_prob_b_prev = -NUM_FLT_INF;
-      child->second->log_prob_nb_prev = -NUM_FLT_INF;
-      child->second->log_prob_b_cur = -NUM_FLT_INF;
-      child->second->log_prob_nb_cur = -NUM_FLT_INF;
+    auto child = children_.begin();
+    for (child = children_.begin(); child != children_.end(); ++child) {
+        if (child->first == new_char) {
+            break;
+        }
    }
-    return (child->second);
-  } else {
-    if (has_dictionary_) {
-      matcher_->SetState(dictionary_state_);
-      bool found = matcher_->Find(new_char + 1);
-      if (!found) {
-        // Adding this character causes word outside dictionary
-        auto FSTZERO = fst::TropicalWeight::Zero();
-        auto final_weight = dictionary_->Final(dictionary_state_);
-        bool is_final = (final_weight != FSTZERO);
-        if (is_final && reset) {
-          dictionary_state_ = dictionary_->Start();
+    if (child != children_.end()) {
+        if (!child->second->exists_) {
+            child->second->exists_ = true;
+            child->second->log_prob_b_prev = -NUM_FLT_INF;
+            child->second->log_prob_nb_prev = -NUM_FLT_INF;
+            child->second->log_prob_b_cur = -NUM_FLT_INF;
+            child->second->log_prob_nb_cur = -NUM_FLT_INF;
        }
-        return nullptr;
-      } else {
-        PathTrie* new_path = new PathTrie;
-        new_path->character = new_char;
-        new_path->parent = this;
-        new_path->dictionary_ = dictionary_;
-        new_path->dictionary_state_ = matcher_->Value().nextstate;
-        new_path->has_dictionary_ = true;
-        new_path->matcher_ = matcher_;
-        children_.push_back(std::make_pair(new_char, new_path));
-        return new_path;
-      }
+        return (child->second);
    } else {
-      PathTrie* new_path = new PathTrie;
-      new_path->character = new_char;
-      new_path->parent = this;
-      children_.push_back(std::make_pair(new_char, new_path));
-      return new_path;
+        if (has_dictionary_) {
+            matcher_->SetState(dictionary_state_);
+            bool found = matcher_->Find(new_char + 1);
+            if (!found) {
+                // Adding this character causes word outside dictionary
+                auto FSTZERO = fst::TropicalWeight::Zero();
+                auto final_weight = dictionary_->Final(dictionary_state_);
+                bool is_final = (final_weight != FSTZERO);
+                if (is_final && reset) {
+                    dictionary_state_ = dictionary_->Start();
+                }
+                return nullptr;
+            } else {
+                PathTrie* new_path = new PathTrie;
+                new_path->character = new_char;
+                new_path->parent = this;
+                new_path->dictionary_ = dictionary_;
+                new_path->dictionary_state_ = matcher_->Value().nextstate;
+                new_path->has_dictionary_ = true;
+                new_path->matcher_ = matcher_;
+                children_.push_back(std::make_pair(new_char, new_path));
+                return new_path;
+            }
+        } else {
+            PathTrie* new_path = new PathTrie;
+            new_path->character = new_char;
+            new_path->parent = this;
+            children_.push_back(std::make_pair(new_char, new_path));
+            return new_path;
+        }
    }
-  }
 }

 PathTrie* PathTrie::get_path_vec(std::vector<int>& output) {
-  return get_path_vec(output, ROOT_);
+    return get_path_vec(output, ROOT_);
 }

 PathTrie* PathTrie::get_path_vec(std::vector<int>& output,
                                 int stop,
                                 size_t max_steps) {
-  if (character == stop || character == ROOT_ || output.size() == max_steps) {
-    std::reverse(output.begin(), output.end());
-    return this;
-  } else {
-    output.push_back(character);
-    return parent->get_path_vec(output, stop, max_steps);
-  }
+    if (character == stop || character == ROOT_ || output.size() == max_steps) {
+        std::reverse(output.begin(), output.end());
+        return this;
+    } else {
+        output.push_back(character);
+        return parent->get_path_vec(output, stop, max_steps);
+    }
 }

 void PathTrie::iterate_to_vec(std::vector<PathTrie*>& output) {
-  if (exists_) {
-    log_prob_b_prev = log_prob_b_cur;
-    log_prob_nb_prev = log_prob_nb_cur;
+    if (exists_) {
+        log_prob_b_prev = log_prob_b_cur;
+        log_prob_nb_prev = log_prob_nb_cur;

-    log_prob_b_cur = -NUM_FLT_INF;
-    log_prob_nb_cur = -NUM_FLT_INF;
+        log_prob_b_cur = -NUM_FLT_INF;
+        log_prob_nb_cur = -NUM_FLT_INF;

-    score = log_sum_exp(log_prob_b_prev, log_prob_nb_prev);
-    output.push_back(this);
-  }
-  for (auto child : children_) {
-    child.second->iterate_to_vec(output);
-  }
+        score = log_sum_exp(log_prob_b_prev, log_prob_nb_prev);
+        output.push_back(this);
+    }
+    for (auto child : children_) {
+        child.second->iterate_to_vec(output);
+    }
 }

 void PathTrie::remove() {
-  exists_ = false;
-
-  if (children_.size() == 0) {
-    auto child = parent->children_.begin();
-    for (child = parent->children_.begin(); child != parent->children_.end();
-         ++child) {
-      if (child->first == character) {
-        parent->children_.erase(child);
-        break;
-      }
-    }
+    exists_ = false;
+
+    if (children_.size() == 0) {
+        auto child = parent->children_.begin();
+        for (child = parent->children_.begin();
+             child != parent->children_.end();
+             ++child) {
+            if (child->first == character) {
+                parent->children_.erase(child);
+                break;
+            }
+        }

-    if (parent->children_.size() == 0 && !parent->exists_) {
-      parent->remove();
-    }
+        if (parent->children_.size() == 0 && !parent->exists_) {
+            parent->remove();
+        }

-    delete this;
-  }
+        delete this;
+    }
 }

 void PathTrie::set_dictionary(fst::StdVectorFst* dictionary) {
-  dictionary_ = dictionary;
-  dictionary_state_ = dictionary->Start();
-  has_dictionary_ = true;
+    dictionary_ = dictionary;
+    dictionary_state_ = dictionary->Start();
+    has_dictionary_ = true;
 }

 using FSTMATCH = fst::SortedMatcher<fst::StdVectorFst>;
 void PathTrie::set_matcher(std::shared_ptr<FSTMATCH> matcher) {
-  matcher_ = matcher;
+    matcher_ = matcher;
 }
--- a/deepspeech/decoders/swig/path_trie.h
+++ b/deepspeech/decoders/swig/path_trie.h
@ -27,55 +27,56 @@
 * finite-state transducer for spelling correction.
 */
 class PathTrie {
-public:
-  PathTrie();
-  ~PathTrie();
+  public:
+    PathTrie();
+    ~PathTrie();

-  // get new prefix after appending new char
-  PathTrie* get_path_trie(int new_char, bool reset = true);
+    // get new prefix after appending new char
+    PathTrie* get_path_trie(int new_char, bool reset = true);

-  // get the prefix in index from root to current node
-  PathTrie* get_path_vec(std::vector<int>& output);
+    // get the prefix in index from root to current node
+    PathTrie* get_path_vec(std::vector<int>& output);

-  // get the prefix in index from some stop node to current nodel
-  PathTrie* get_path_vec(std::vector<int>& output,
-                         int stop,
-                         size_t max_steps = std::numeric_limits<size_t>::max());
+    // get the prefix in index from some stop node to current nodel
+    PathTrie* get_path_vec(
+        std::vector<int>& output,
+        int stop,
+        size_t max_steps = std::numeric_limits<size_t>::max());

-  // update log probs
-  void iterate_to_vec(std::vector<PathTrie*>& output);
+    // update log probs
+    void iterate_to_vec(std::vector<PathTrie*>& output);

-  // set dictionary for FST
-  void set_dictionary(fst::StdVectorFst* dictionary);
+    // set dictionary for FST
+    void set_dictionary(fst::StdVectorFst* dictionary);

-  void set_matcher(std::shared_ptr<fst::SortedMatcher<fst::StdVectorFst>>);
+    void set_matcher(std::shared_ptr<fst::SortedMatcher<fst::StdVectorFst>>);

-  bool is_empty() { return ROOT_ == character; }
+    bool is_empty() { return ROOT_ == character; }

-  // remove current path from root
-  void remove();
+    // remove current path from root
+    void remove();

-  float log_prob_b_prev;
-  float log_prob_nb_prev;
-  float log_prob_b_cur;
-  float log_prob_nb_cur;
-  float score;
-  float approx_ctc;
-  int character;
-  PathTrie* parent;
+    float log_prob_b_prev;
+    float log_prob_nb_prev;
+    float log_prob_b_cur;
+    float log_prob_nb_cur;
+    float score;
+    float approx_ctc;
+    int character;
+    PathTrie* parent;

-private:
-  int ROOT_;
-  bool exists_;
-  bool has_dictionary_;
+  private:
+    int ROOT_;
+    bool exists_;
+    bool has_dictionary_;

-  std::vector<std::pair<int, PathTrie*>> children_;
+    std::vector<std::pair<int, PathTrie*>> children_;

-  // pointer to dictionary of FST
-  fst::StdVectorFst* dictionary_;
-  fst::StdVectorFst::StateId dictionary_state_;
-  // true if finding ars in FST
-  std::shared_ptr<fst::SortedMatcher<fst::StdVectorFst>> matcher_;
+    // pointer to dictionary of FST
+    fst::StdVectorFst* dictionary_;
+    fst::StdVectorFst::StateId dictionary_state_;
+    // true if finding ars in FST
+    std::shared_ptr<fst::SortedMatcher<fst::StdVectorFst>> matcher_;
 };

 #endif  // PATH_TRIE_H
--- a/deepspeech/decoders/swig/scorer.cpp
+++ b/deepspeech/decoders/swig/scorer.cpp
@ -31,214 +31,214 @@ Scorer::Scorer(double alpha,
               double beta,
               const std::string& lm_path,
               const std::vector<std::string>& vocab_list) {
-  this->alpha = alpha;
-  this->beta = beta;
+    this->alpha = alpha;
+    this->beta = beta;

-  dictionary = nullptr;
-  is_character_based_ = true;
-  language_model_ = nullptr;
+    dictionary = nullptr;
+    is_character_based_ = true;
+    language_model_ = nullptr;

-  max_order_ = 0;
-  dict_size_ = 0;
-  SPACE_ID_ = -1;
+    max_order_ = 0;
+    dict_size_ = 0;
+    SPACE_ID_ = -1;

-  setup(lm_path, vocab_list);
+    setup(lm_path, vocab_list);
 }

 Scorer::~Scorer() {
-  if (language_model_ != nullptr) {
-    delete static_cast<lm::base::Model*>(language_model_);
-  }
-  if (dictionary != nullptr) {
-    delete static_cast<fst::StdVectorFst*>(dictionary);
-  }
+    if (language_model_ != nullptr) {
+        delete static_cast<lm::base::Model*>(language_model_);
+    }
+    if (dictionary != nullptr) {
+        delete static_cast<fst::StdVectorFst*>(dictionary);
+    }
 }

 void Scorer::setup(const std::string& lm_path,
                   const std::vector<std::string>& vocab_list) {
-  // load language model
-  load_lm(lm_path);
-  // set char map for scorer
-  set_char_map(vocab_list);
-  // fill the dictionary for FST
-  if (!is_character_based()) {
-    fill_dictionary(true);
-  }
+    // load language model
+    load_lm(lm_path);
+    // set char map for scorer
+    set_char_map(vocab_list);
+    // fill the dictionary for FST
+    if (!is_character_based()) {
+        fill_dictionary(true);
+    }
 }

 void Scorer::load_lm(const std::string& lm_path) {
-  const char* filename = lm_path.c_str();
-  VALID_CHECK_EQ(access(filename, F_OK), 0, "Invalid language model path");
-
-  RetriveStrEnumerateVocab enumerate;
-  lm::ngram::Config config;
-  config.enumerate_vocab = &enumerate;
-  language_model_ = lm::ngram::LoadVirtual(filename, config);
-  max_order_ = static_cast<lm::base::Model*>(language_model_)->Order();
-  vocabulary_ = enumerate.vocabulary;
-  for (size_t i = 0; i < vocabulary_.size(); ++i) {
-    if (is_character_based_ && vocabulary_[i] != UNK_TOKEN &&
-        vocabulary_[i] != START_TOKEN && vocabulary_[i] != END_TOKEN &&
-        get_utf8_str_len(enumerate.vocabulary[i]) > 1) {
-      is_character_based_ = false;
+    const char* filename = lm_path.c_str();
+    VALID_CHECK_EQ(access(filename, F_OK), 0, "Invalid language model path");
+
+    RetriveStrEnumerateVocab enumerate;
+    lm::ngram::Config config;
+    config.enumerate_vocab = &enumerate;
+    language_model_ = lm::ngram::LoadVirtual(filename, config);
+    max_order_ = static_cast<lm::base::Model*>(language_model_)->Order();
+    vocabulary_ = enumerate.vocabulary;
+    for (size_t i = 0; i < vocabulary_.size(); ++i) {
+        if (is_character_based_ && vocabulary_[i] != UNK_TOKEN &&
+            vocabulary_[i] != START_TOKEN && vocabulary_[i] != END_TOKEN &&
+            get_utf8_str_len(enumerate.vocabulary[i]) > 1) {
+            is_character_based_ = false;
+        }
    }
-  }
 }

 double Scorer::get_log_cond_prob(const std::vector<std::string>& words) {
-  lm::base::Model* model = static_cast<lm::base::Model*>(language_model_);
-  double cond_prob;
-  lm::ngram::State state, tmp_state, out_state;
-  // avoid to inserting <s> in begin
-  model->NullContextWrite(&state);
-  for (size_t i = 0; i < words.size(); ++i) {
-    lm::WordIndex word_index = model->BaseVocabulary().Index(words[i]);
-    // encounter OOV
-    if (word_index == 0) {
-      return OOV_SCORE;
+    lm::base::Model* model = static_cast<lm::base::Model*>(language_model_);
+    double cond_prob;
+    lm::ngram::State state, tmp_state, out_state;
+    // avoid to inserting <s> in begin
+    model->NullContextWrite(&state);
+    for (size_t i = 0; i < words.size(); ++i) {
+        lm::WordIndex word_index = model->BaseVocabulary().Index(words[i]);
+        // encounter OOV
+        if (word_index == 0) {
+            return OOV_SCORE;
+        }
+        cond_prob = model->BaseScore(&state, word_index, &out_state);
+        tmp_state = state;
+        state = out_state;
+        out_state = tmp_state;
    }
-    cond_prob = model->BaseScore(&state, word_index, &out_state);
-    tmp_state = state;
-    state = out_state;
-    out_state = tmp_state;
-  }
-  // return  log10 prob
-  return cond_prob;
+    // return  log10 prob
+    return cond_prob;
 }

 double Scorer::get_sent_log_prob(const std::vector<std::string>& words) {
-  std::vector<std::string> sentence;
-  if (words.size() == 0) {
-    for (size_t i = 0; i < max_order_; ++i) {
-      sentence.push_back(START_TOKEN);
-    }
-  } else {
-    for (size_t i = 0; i < max_order_ - 1; ++i) {
-      sentence.push_back(START_TOKEN);
+    std::vector<std::string> sentence;
+    if (words.size() == 0) {
+        for (size_t i = 0; i < max_order_; ++i) {
+            sentence.push_back(START_TOKEN);
+        }
+    } else {
+        for (size_t i = 0; i < max_order_ - 1; ++i) {
+            sentence.push_back(START_TOKEN);
+        }
+        sentence.insert(sentence.end(), words.begin(), words.end());
    }
-    sentence.insert(sentence.end(), words.begin(), words.end());
-  }
-  sentence.push_back(END_TOKEN);
-  return get_log_prob(sentence);
+    sentence.push_back(END_TOKEN);
+    return get_log_prob(sentence);
 }

 double Scorer::get_log_prob(const std::vector<std::string>& words) {
-  assert(words.size() > max_order_);
-  double score = 0.0;
-  for (size_t i = 0; i < words.size() - max_order_ + 1; ++i) {
-    std::vector<std::string> ngram(words.begin() + i,
-                                   words.begin() + i + max_order_);
-    score += get_log_cond_prob(ngram);
-  }
-  return score;
+    assert(words.size() > max_order_);
+    double score = 0.0;
+    for (size_t i = 0; i < words.size() - max_order_ + 1; ++i) {
+        std::vector<std::string> ngram(words.begin() + i,
+                                       words.begin() + i + max_order_);
+        score += get_log_cond_prob(ngram);
+    }
+    return score;
 }

 void Scorer::reset_params(float alpha, float beta) {
-  this->alpha = alpha;
-  this->beta = beta;
+    this->alpha = alpha;
+    this->beta = beta;
 }

 std::string Scorer::vec2str(const std::vector<int>& input) {
-  std::string word;
-  for (auto ind : input) {
-    word += char_list_[ind];
-  }
-  return word;
+    std::string word;
+    for (auto ind : input) {
+        word += char_list_[ind];
+    }
+    return word;
 }

 std::vector<std::string> Scorer::split_labels(const std::vector<int>& labels) {
-  if (labels.empty()) return {};
-
-  std::string s = vec2str(labels);
-  std::vector<std::string> words;
-  if (is_character_based_) {
-    words = split_utf8_str(s);
-  } else {
-    words = split_str(s, " ");
-  }
-  return words;
+    if (labels.empty()) return {};
+
+    std::string s = vec2str(labels);
+    std::vector<std::string> words;
+    if (is_character_based_) {
+        words = split_utf8_str(s);
+    } else {
+        words = split_str(s, " ");
+    }
+    return words;
 }

 void Scorer::set_char_map(const std::vector<std::string>& char_list) {
-  char_list_ = char_list;
-  char_map_.clear();
-
-  // Set the char map for the FST for spelling correction
-  for (size_t i = 0; i < char_list_.size(); i++) {
-    if (char_list_[i] == " ") {
-      SPACE_ID_ = i;
+    char_list_ = char_list;
+    char_map_.clear();
+
+    // Set the char map for the FST for spelling correction
+    for (size_t i = 0; i < char_list_.size(); i++) {
+        if (char_list_[i] == " ") {
+            SPACE_ID_ = i;
+        }
+        // The initial state of FST is state 0, hence the index of chars in
+        // the FST should start from 1 to avoid the conflict with the initial
+        // state, otherwise wrong decoding results would be given.
+        char_map_[char_list_[i]] = i + 1;
    }
-    // The initial state of FST is state 0, hence the index of chars in
-    // the FST should start from 1 to avoid the conflict with the initial
-    // state, otherwise wrong decoding results would be given.
-    char_map_[char_list_[i]] = i + 1;
-  }
 }

 std::vector<std::string> Scorer::make_ngram(PathTrie* prefix) {
-  std::vector<std::string> ngram;
-  PathTrie* current_node = prefix;
-  PathTrie* new_node = nullptr;
-
-  for (int order = 0; order < max_order_; order++) {
-    std::vector<int> prefix_vec;
-
-    if (is_character_based_) {
-      new_node = current_node->get_path_vec(prefix_vec, SPACE_ID_, 1);
-      current_node = new_node;
-    } else {
-      new_node = current_node->get_path_vec(prefix_vec, SPACE_ID_);
-      current_node = new_node->parent;  // Skipping spaces
+    std::vector<std::string> ngram;
+    PathTrie* current_node = prefix;
+    PathTrie* new_node = nullptr;
+
+    for (int order = 0; order < max_order_; order++) {
+        std::vector<int> prefix_vec;
+
+        if (is_character_based_) {
+            new_node = current_node->get_path_vec(prefix_vec, SPACE_ID_, 1);
+            current_node = new_node;
+        } else {
+            new_node = current_node->get_path_vec(prefix_vec, SPACE_ID_);
+            current_node = new_node->parent;  // Skipping spaces
+        }
+
+        // reconstruct word
+        std::string word = vec2str(prefix_vec);
+        ngram.push_back(word);
+
+        if (new_node->character == -1) {
+            // No more spaces, but still need order
+            for (int i = 0; i < max_order_ - order - 1; i++) {
+                ngram.push_back(START_TOKEN);
+            }
+            break;
+        }
    }
-
-    // reconstruct word
-    std::string word = vec2str(prefix_vec);
-    ngram.push_back(word);
-
-    if (new_node->character == -1) {
-      // No more spaces, but still need order
-      for (int i = 0; i < max_order_ - order - 1; i++) {
-        ngram.push_back(START_TOKEN);
-      }
-      break;
-    }
-  }
-  std::reverse(ngram.begin(), ngram.end());
-  return ngram;
+    std::reverse(ngram.begin(), ngram.end());
+    return ngram;
 }

 void Scorer::fill_dictionary(bool add_space) {
-  fst::StdVectorFst dictionary;
-  // For each unigram convert to ints and put in trie
-  int dict_size = 0;
-  for (const auto& word : vocabulary_) {
-    bool added = add_word_to_dictionary(
-        word, char_map_, add_space, SPACE_ID_ + 1, &dictionary);
-    dict_size += added ? 1 : 0;
-  }
-
-  dict_size_ = dict_size;
-
-  /* Simplify FST
-
-   * This gets rid of "epsilon" transitions in the FST.
-   * These are transitions that don't require a string input to be taken.
-   * Getting rid of them is necessary to make the FST determinisitc, but
-   * can greatly increase the size of the FST
-   */
-  fst::RmEpsilon(&dictionary);
-  fst::StdVectorFst* new_dict = new fst::StdVectorFst;
-
-  /* This makes the FST deterministic, meaning for any string input there's
-   * only one possible state the FST could be in.  It is assumed our
-   * dictionary is deterministic when using it.
-   * (lest we'd have to check for multiple transitions at each state)
-   */
-  fst::Determinize(dictionary, new_dict);
-
-  /* Finds the simplest equivalent fst. This is unnecessary but decreases
-   * memory usage of the dictionary
-   */
-  fst::Minimize(new_dict);
-  this->dictionary = new_dict;
+    fst::StdVectorFst dictionary;
+    // For each unigram convert to ints and put in trie
+    int dict_size = 0;
+    for (const auto& word : vocabulary_) {
+        bool added = add_word_to_dictionary(
+            word, char_map_, add_space, SPACE_ID_ + 1, &dictionary);
+        dict_size += added ? 1 : 0;
+    }
+
+    dict_size_ = dict_size;
+
+    /* Simplify FST
+
+     * This gets rid of "epsilon" transitions in the FST.
+     * These are transitions that don't require a string input to be taken.
+     * Getting rid of them is necessary to make the FST determinisitc, but
+     * can greatly increase the size of the FST
+     */
+    fst::RmEpsilon(&dictionary);
+    fst::StdVectorFst* new_dict = new fst::StdVectorFst;
+
+    /* This makes the FST deterministic, meaning for any string input there's
+     * only one possible state the FST could be in.  It is assumed our
+     * dictionary is deterministic when using it.
+     * (lest we'd have to check for multiple transitions at each state)
+     */
+    fst::Determinize(dictionary, new_dict);
+
+    /* Finds the simplest equivalent fst. This is unnecessary but decreases
+     * memory usage of the dictionary
+     */
+    fst::Minimize(new_dict);
+    this->dictionary = new_dict;
 }
--- a/deepspeech/decoders/swig/scorer.h
+++ b/deepspeech/decoders/swig/scorer.h
@ -34,14 +34,14 @@ const std::string END_TOKEN = "</s>";

 // Implement a callback to retrive the dictionary of language model.
 class RetriveStrEnumerateVocab : public lm::EnumerateVocab {
-public:
-  RetriveStrEnumerateVocab() {}
+  public:
+    RetriveStrEnumerateVocab() {}

-  void Add(lm::WordIndex index, const StringPiece &str) {
-    vocabulary.push_back(std::string(str.data(), str.length()));
-  }
+    void Add(lm::WordIndex index, const StringPiece &str) {
+        vocabulary.push_back(std::string(str.data(), str.length()));
+    }

-  std::vector<std::string> vocabulary;
+    std::vector<std::string> vocabulary;
 };

 /* External scorer to query score for n-gram or sentence, including language
@ -53,74 +53,74 @@ public:
 *     scorer.get_sent_log_prob({ "WORD1", "WORD2", "WORD3" });
 */
 class Scorer {
-public:
-  Scorer(double alpha,
-         double beta,
-         const std::string &lm_path,
-         const std::vector<std::string> &vocabulary);
-  ~Scorer();
+  public:
+    Scorer(double alpha,
+           double beta,
+           const std::string &lm_path,
+           const std::vector<std::string> &vocabulary);
+    ~Scorer();

-  double get_log_cond_prob(const std::vector<std::string> &words);
+    double get_log_cond_prob(const std::vector<std::string> &words);

-  double get_sent_log_prob(const std::vector<std::string> &words);
+    double get_sent_log_prob(const std::vector<std::string> &words);

-  // return the max order
-  size_t get_max_order() const { return max_order_; }
+    // return the max order
+    size_t get_max_order() const { return max_order_; }

-  // return the dictionary size of language model
-  size_t get_dict_size() const { return dict_size_; }
+    // return the dictionary size of language model
+    size_t get_dict_size() const { return dict_size_; }

-  // retrun true if the language model is character based
-  bool is_character_based() const { return is_character_based_; }
+    // retrun true if the language model is character based
+    bool is_character_based() const { return is_character_based_; }

-  // reset params alpha & beta
-  void reset_params(float alpha, float beta);
+    // reset params alpha & beta
+    void reset_params(float alpha, float beta);

-  // make ngram for a given prefix
-  std::vector<std::string> make_ngram(PathTrie *prefix);
+    // make ngram for a given prefix
+    std::vector<std::string> make_ngram(PathTrie *prefix);

-  // trransform the labels in index to the vector of words (word based lm) or
-  // the vector of characters (character based lm)
-  std::vector<std::string> split_labels(const std::vector<int> &labels);
+    // trransform the labels in index to the vector of words (word based lm) or
+    // the vector of characters (character based lm)
+    std::vector<std::string> split_labels(const std::vector<int> &labels);

-  // language model weight
-  double alpha;
-  // word insertion weight
-  double beta;
+    // language model weight
+    double alpha;
+    // word insertion weight
+    double beta;

-  // pointer to the dictionary of FST
-  void *dictionary;
+    // pointer to the dictionary of FST
+    void *dictionary;

-protected:
-  // necessary setup: load language model, set char map, fill FST's dictionary
-  void setup(const std::string &lm_path,
-             const std::vector<std::string> &vocab_list);
+  protected:
+    // necessary setup: load language model, set char map, fill FST's dictionary
+    void setup(const std::string &lm_path,
+               const std::vector<std::string> &vocab_list);

-  // load language model from given path
-  void load_lm(const std::string &lm_path);
+    // load language model from given path
+    void load_lm(const std::string &lm_path);

-  // fill dictionary for FST
-  void fill_dictionary(bool add_space);
+    // fill dictionary for FST
+    void fill_dictionary(bool add_space);

-  // set char map
-  void set_char_map(const std::vector<std::string> &char_list);
+    // set char map
+    void set_char_map(const std::vector<std::string> &char_list);

-  double get_log_prob(const std::vector<std::string> &words);
+    double get_log_prob(const std::vector<std::string> &words);

-  // translate the vector in index to string
-  std::string vec2str(const std::vector<int> &input);
+    // translate the vector in index to string
+    std::string vec2str(const std::vector<int> &input);

-private:
-  void *language_model_;
-  bool is_character_based_;
-  size_t max_order_;
-  size_t dict_size_;
+  private:
+    void *language_model_;
+    bool is_character_based_;
+    size_t max_order_;
+    size_t dict_size_;

-  int SPACE_ID_;
-  std::vector<std::string> char_list_;
-  std::unordered_map<std::string, int> char_map_;
+    int SPACE_ID_;
+    std::vector<std::string> char_list_;
+    std::unordered_map<std::string, int> char_map_;

-  std::vector<std::string> vocabulary_;
+    std::vector<std::string> vocabulary_;
 };

 #endif  // SCORER_H_
--- a/deepspeech/decoders/swig/setup.py
+++ b/deepspeech/decoders/swig/setup.py
@ -12,13 +12,16 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 """Script to build and install decoder package."""
-
-from setuptools import setup, Extension, distutils
+import argparse
 import glob
-import platform
-import os, sys
 import multiprocessing.pool
-import argparse
+import os
+import platform
+import sys
+
+from setuptools import distutils
+from setuptools import Extension
+from setuptools import setup

 parser = argparse.ArgumentParser(description=__doc__)
 parser.add_argument(
@ -65,9 +68,9 @@ def parallelCCompile(self,
 def compile_test(header, library):
    dummy_path = os.path.join(os.path.dirname(__file__), "dummy")
    command = "bash -c \"g++ -include " + header \
-                + " -l" + library + " -x c++ - <<<'int main() {}' -o " \
-                + dummy_path + " >/dev/null 2>/dev/null && rm " \
-                + dummy_path + " 2>/dev/null\""
+        + " -l" + library + " -x c++ - <<<'int main() {}' -o " \
+        + dummy_path + " >/dev/null 2>/dev/null && rm " \
+        + dummy_path + " 2>/dev/null\""
    return os.system(command) == 0


@ -75,8 +78,8 @@ def compile_test(header, library):
 distutils.ccompiler.CCompiler.compile = parallelCCompile

 FILES = glob.glob('kenlm/util/*.cc') \
-        + glob.glob('kenlm/lm/*.cc') \
-        + glob.glob('kenlm/util/double-conversion/*.cc')
+    + glob.glob('kenlm/lm/*.cc') \
+    + glob.glob('kenlm/util/double-conversion/*.cc')

 FILES += glob.glob('openfst-1.6.3/src/lib/*.cc')

--- a/deepspeech/decoders/swig/setup.sh
+++ b/deepspeech/decoders/swig/setup.sh
--- a/deepspeech/decoders/swig_wrapper.py
+++ b/deepspeech/decoders/swig_wrapper.py
@ -12,7 +12,6 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 """Wrapper for various CTC decoders in SWIG."""
-
 import swig_decoders


--- a/deepspeech/decoders/tests/test_decoders.py
+++ b/deepspeech/decoders/tests/test_decoders.py
@ -12,8 +12,8 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 """Test decoders."""
-
 import unittest
+
 from deepspeech.decoders import decoders_deprecated as decoder


--- a/deepspeech/exps/deepspeech2/bin/deploy/client.py
+++ b/deepspeech/exps/deepspeech2/bin/deploy/client.py
@ -12,11 +12,10 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 """Client-end for the ASR demo."""
-import keyboard
-import struct
-import socket
-import sys
 import argparse
+import sys
+
+import keyboard
 import pyaudio

 from deepspeech.utils.socket_server import socket_send
@ -49,7 +48,7 @@ def on_press_release(x):
            sys.stdout.flush()
            is_recording = True
    if x.event_type == 'up' and x.name == release.name:
-        if is_recording == True:
+        if is_recording:
            is_recording = False


--- a/deepspeech/exps/deepspeech2/bin/deploy/record.py
+++ b/deepspeech/exps/deepspeech2/bin/deploy/record.py
@ -13,9 +13,10 @@
 # limitations under the License.
 """Record wav from Microphone"""
 # http://people.csail.mit.edu/hubert/pyaudio/
-import pyaudio
 import wave

+import pyaudio
+
 CHUNK = 1024
 FORMAT = pyaudio.paInt16
 CHANNELS = 1
--- a/deepspeech/exps/deepspeech2/bin/deploy/runtime.py
+++ b/deepspeech/exps/deepspeech2/bin/deploy/runtime.py
@ -12,28 +12,22 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 """Server-end for the ASR demo."""
-import os
-import time
-import argparse
 import functools
-import paddle
-import numpy as np

-from deepspeech.utils.socket_server import warm_up_test
-from deepspeech.utils.socket_server import AsrTCPServer
-from deepspeech.utils.socket_server import AsrRequestHandler
+import numpy as np
+import paddle
+from paddle.inference import Config
+from paddle.inference import create_predictor

-from deepspeech.training.cli import default_argument_parser
 from deepspeech.exps.deepspeech2.config import get_cfg_defaults
-
-from deepspeech.frontend.utility import read_manifest
-from deepspeech.utils.utility import add_arguments, print_arguments
-
-from deepspeech.models.deepspeech2 import DeepSpeech2Model
 from deepspeech.io.dataset import ManifestDataset
-
-from paddle.inference import Config
-from paddle.inference import create_predictor
+from deepspeech.models.deepspeech2 import DeepSpeech2Model
+from deepspeech.training.cli import default_argument_parser
+from deepspeech.utils.socket_server import AsrRequestHandler
+from deepspeech.utils.socket_server import AsrTCPServer
+from deepspeech.utils.socket_server import warm_up_test
+from deepspeech.utils.utility import add_arguments
+from deepspeech.utils.utility import print_arguments


 def init_predictor(args):
@ -83,23 +77,11 @@ def inference(config, args):

 def start_server(config, args):
    """Start the ASR server"""
-    dataset = ManifestDataset(
-        config.data.test_manifest,
-        config.data.vocab_filepath,
-        config.data.mean_std_filepath,
-        augmentation_config="{}",
-        max_duration=config.data.max_duration,
-        min_duration=config.data.min_duration,
-        stride_ms=config.data.stride_ms,
-        window_ms=config.data.window_ms,
-        n_fft=config.data.n_fft,
-        max_freq=config.data.max_freq,
-        target_sample_rate=config.data.target_sample_rate,
-        specgram_type=config.data.specgram_type,
-        use_dB_normalization=config.data.use_dB_normalization,
-        target_dB=config.data.target_dB,
-        random_seed=config.data.random_seed,
-        keep_transcription_text=True)
+    config.defrost()
+    config.data.manfiest = config.data.test_manifest
+    config.data.augmentation_config = ""
+    config.data.keep_transcription_text = True
+    dataset = ManifestDataset.from_config(config)

    model = DeepSpeech2Model.from_pretrained(dataset, config,
                                             args.checkpoint_path)
@ -171,22 +153,20 @@ if __name__ == "__main__":
        "--params_file",
        type=str,
        default="",
-        help=
-        "Parameter filename, Specify this when your model is a combined model."
+        help="Parameter filename, Specify this when your model is a combined model."
    )
    add_arg(
        "--model_dir",
        type=str,
        default=None,
-        help=
-        "Model dir, If you load a non-combined model, specify the directory of the model."
+        help="Model dir, If you load a non-combined model, specify the directory of the model."
    )
    add_arg("--use_gpu",
                        type=bool,
                        default=False,
                        help="Whether use gpu.")
    args = parser.parse_args()
-    print_arguments(args)
+    print_arguments(args, globals())

    # https://yaml.org/type/float.html
    config = get_cfg_defaults()
@ -198,7 +178,7 @@ if __name__ == "__main__":
    print(config)

    args.warmup_manifest = config.data.test_manifest
-    print_arguments(args)
+    print_arguments(args, globals())

    if args.dump_config:
        with open(args.dump_config, 'w') as f:
--- a/deepspeech/exps/deepspeech2/bin/deploy/send.py
+++ b/deepspeech/exps/deepspeech2/bin/deploy/send.py
@ -12,8 +12,6 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 """Socket client to send wav to ASR server."""
-import struct
-import socket
 import argparse
 import wave

--- a/deepspeech/exps/deepspeech2/bin/deploy/server.py
+++ b/deepspeech/exps/deepspeech2/bin/deploy/server.py
@ -12,46 +12,30 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 """Server-end for the ASR demo."""
-import os
-import time
-import argparse
 import functools
-import paddle
-import numpy as np

-from deepspeech.utils.socket_server import warm_up_test
-from deepspeech.utils.socket_server import AsrTCPServer
-from deepspeech.utils.socket_server import AsrRequestHandler
+import numpy as np
+import paddle

-from deepspeech.training.cli import default_argument_parser
 from deepspeech.exps.deepspeech2.config import get_cfg_defaults
-
-from deepspeech.frontend.utility import read_manifest
-from deepspeech.utils.utility import add_arguments, print_arguments
-
-from deepspeech.models.deepspeech2 import DeepSpeech2Model
 from deepspeech.io.dataset import ManifestDataset
+from deepspeech.models.deepspeech2 import DeepSpeech2Model
+from deepspeech.training.cli import default_argument_parser
+from deepspeech.utils.socket_server import AsrRequestHandler
+from deepspeech.utils.socket_server import AsrTCPServer
+from deepspeech.utils.socket_server import warm_up_test
+from deepspeech.utils.utility import add_arguments
+from deepspeech.utils.utility import print_arguments


 def start_server(config, args):
    """Start the ASR server"""
-    dataset = ManifestDataset(
-        config.data.test_manifest,
-        config.data.vocab_filepath,
-        config.data.mean_std_filepath,
-        augmentation_config="{}",
-        max_duration=config.data.max_duration,
-        min_duration=config.data.min_duration,
-        stride_ms=config.data.stride_ms,
-        window_ms=config.data.window_ms,
-        n_fft=config.data.n_fft,
-        max_freq=config.data.max_freq,
-        target_sample_rate=config.data.target_sample_rate,
-        specgram_type=config.data.specgram_type,
-        use_dB_normalization=config.data.use_dB_normalization,
-        target_dB=config.data.target_dB,
-        random_seed=config.data.random_seed,
-        keep_transcription_text=True)
+    config.defrost()
+    config.data.manfiest = config.data.test_manifest
+    config.data.augmentation_config = ""
+    config.data.keep_transcription_text = True
+    dataset = ManifestDataset.from_config(config)
+
    model = DeepSpeech2Model.from_pretrained(dataset, config,
                                             args.checkpoint_path)
    model.eval()
@ -111,9 +95,9 @@ if __name__ == "__main__":
    add_arg('speech_save_dir',  str,
            'demo_cache',
            "Directory to save demo audios.")
-    add_arg('warmup_manifest',  str, None, "Filepath of manifest to warm up.")
+    add_arg('warmup_manifest', str, None, "Filepath of manifest to warm up.")
    args = parser.parse_args()
-    print_arguments(args)
+    print_arguments(args, globals())

    # https://yaml.org/type/float.html
    config = get_cfg_defaults()
@ -125,7 +109,7 @@ if __name__ == "__main__":
    print(config)

    args.warmup_manifest = config.data.test_manifest
-    print_arguments(args)
+    print_arguments(args, globals())

    if args.dump_config:
        with open(args.dump_config, 'w') as f:
--- a/deepspeech/exps/deepspeech2/bin/export.py
+++ b/deepspeech/exps/deepspeech2/bin/export.py
@ -12,20 +12,10 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 """Export for DeepSpeech2 model."""
-
-import io
-import logging
-import argparse
-import functools
-
-from paddle import distributed as dist
-
-from deepspeech.training.cli import default_argument_parser
-from deepspeech.utils.utility import print_arguments
-from deepspeech.utils.error_rate import char_errors, word_errors
-
 from deepspeech.exps.deepspeech2.config import get_cfg_defaults
 from deepspeech.exps.deepspeech2.model import DeepSpeech2Tester as Tester
+from deepspeech.training.cli import default_argument_parser
+from deepspeech.utils.utility import print_arguments


 def main_sp(config, args):
--- a/deepspeech/exps/deepspeech2/bin/test.py
+++ b/deepspeech/exps/deepspeech2/bin/test.py
@ -12,20 +12,10 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 """Evaluation for DeepSpeech2 model."""
-
-import io
-import logging
-import argparse
-import functools
-
-from paddle import distributed as dist
-
-from deepspeech.training.cli import default_argument_parser
-from deepspeech.utils.utility import print_arguments
-from deepspeech.utils.error_rate import char_errors, word_errors
-
 from deepspeech.exps.deepspeech2.config import get_cfg_defaults
 from deepspeech.exps.deepspeech2.model import DeepSpeech2Tester as Tester
+from deepspeech.training.cli import default_argument_parser
+from deepspeech.utils.utility import print_arguments


 def main_sp(config, args):
@ -41,7 +31,7 @@ def main(config, args):
 if __name__ == "__main__":
    parser = default_argument_parser()
    args = parser.parse_args()
-    print_arguments(args)
+    print_arguments(args, globals())

    # https://yaml.org/type/float.html
    config = get_cfg_defaults()
--- a/deepspeech/exps/deepspeech2/bin/train.py
+++ b/deepspeech/exps/deepspeech2/bin/train.py
@ -12,19 +12,12 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 """Trainer for DeepSpeech2 model."""
-
-import io
-import logging
-import argparse
-import functools
-
 from paddle import distributed as dist

-from deepspeech.utils.utility import print_arguments
-from deepspeech.training.cli import default_argument_parser
-
 from deepspeech.exps.deepspeech2.config import get_cfg_defaults
 from deepspeech.exps.deepspeech2.model import DeepSpeech2Trainer as Trainer
+from deepspeech.training.cli import default_argument_parser
+from deepspeech.utils.utility import print_arguments


 def main_sp(config, args):
@ -43,7 +36,7 @@ def main(config, args):
 if __name__ == "__main__":
    parser = default_argument_parser()
    args = parser.parse_args()
-    print_arguments(args)
+    print_arguments(args, globals())

    # https://yaml.org/type/float.html
    config = get_cfg_defaults()
--- a/deepspeech/exps/deepspeech2/bin/tune.py
+++ b/deepspeech/exps/deepspeech2/bin/tune.py
@ -12,26 +12,20 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 """Beam search parameters tuning for DeepSpeech2 model."""
-
-import sys
-import os
-import numpy as np
-import argparse
 import functools
-import gzip
-import logging
+import sys

+import numpy as np
 from paddle.io import DataLoader

-from deepspeech.utils import error_rate
-from deepspeech.utils.utility import add_arguments, print_arguments
-
-from deepspeech.models.deepspeech2 import DeepSpeech2Model
+from deepspeech.exps.deepspeech2.config import get_cfg_defaults
 from deepspeech.io.collator import SpeechCollator
 from deepspeech.io.dataset import ManifestDataset
-
+from deepspeech.models.deepspeech2 import DeepSpeech2Model
 from deepspeech.training.cli import default_argument_parser
-from deepspeech.exps.deepspeech2.config import get_cfg_defaults
+from deepspeech.utils import error_rate
+from deepspeech.utils.utility import add_arguments
+from deepspeech.utils.utility import print_arguments


 def tune(config, args):
@ -40,31 +34,18 @@ def tune(config, args):
        raise ValueError("num_alphas must be non-negative!")
    if not args.num_betas >= 0:
        raise ValueError("num_betas must be non-negative!")
-
-    dev_dataset = ManifestDataset(
-        config.data.dev_manifest,
-        config.data.vocab_filepath,
-        config.data.mean_std_filepath,
-        augmentation_config="{}",
-        max_duration=config.data.max_duration,
-        min_duration=config.data.min_duration,
-        stride_ms=config.data.stride_ms,
-        window_ms=config.data.window_ms,
-        n_fft=config.data.n_fft,
-        max_freq=config.data.max_freq,
-        target_sample_rate=config.data.target_sample_rate,
-        specgram_type=config.data.specgram_type,
-        use_dB_normalization=config.data.use_dB_normalization,
-        target_dB=config.data.target_dB,
-        random_seed=config.data.random_seed,
-        keep_transcription_text=True)
+    config.defrost()
+    config.data.manfiest = config.data.dev_manifest
+    config.data.augmentation_config = ""
+    config.data.keep_transcription_text = True
+    dev_dataset = ManifestDataset.from_config(config)

    valid_loader = DataLoader(
        dev_dataset,
        batch_size=config.data.batch_size,
        shuffle=False,
        drop_last=False,
-        collate_fn=SpeechCollator(is_training=False))
+        collate_fn=SpeechCollator(keep_transcription_text=True))

    model = DeepSpeech2Model.from_pretrained(dev_dataset, config,
                                             args.checkpoint_path)
@ -103,13 +84,13 @@ def tune(config, args):
                trans.append(''.join([chr(i) for i in ids]))
            return trans

-        audio, text, audio_len, text_len = infer_data
+        audio, audio_len, text, text_len = infer_data
        target_transcripts = ordid2token(text, text_len)
        num_ins += audio.shape[0]

        # model infer
        eouts, eouts_len = model.encoder(audio, audio_len)
-        probs = model.decoder.probs(eouts)
+        probs = model.decoder.softmax(eouts)

        # grid search
        for index, (alpha, beta) in enumerate(params_grid):
@ -134,7 +115,7 @@ def tune(config, args):
            if index % 2 == 0:
                sys.stdout.write('.')
                sys.stdout.flush()
-            print(f"tuneing: one grid done!")
+            print("tuneing: one grid done!")

        # output on-line tuning result at the end of current batch
        err_ave_min = min(err_ave)
@ -185,7 +166,7 @@ if __name__ == "__main__":
    add_arg('cutoff_top_n', int, 40, "Cutoff number for pruning.")

    args = parser.parse_args()
-    print_arguments(args)
+    print_arguments(args, globals())

    # https://yaml.org/type/float.html
    config = get_cfg_defaults()
--- a/deepspeech/exps/deepspeech2/config.py
+++ b/deepspeech/exps/deepspeech2/config.py
@ -11,8 +11,8 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
-
 from yacs.config import CfgNode as CN
+
 from deepspeech.models.deepspeech2 import DeepSpeech2Model

 _C = CN()
@ -21,7 +21,9 @@ _C.data = CN(
        train_manifest="",
        dev_manifest="",
        test_manifest="",
+        unit_type="char",
        vocab_filepath="",
+        spm_model_prefix="",
        mean_std_filepath="",
        augmentation_config="",
        max_duration=float('inf'),
@ -30,8 +32,10 @@ _C.data = CN(
        window_ms=20.0,  # ms
        n_fft=None,  # fft points
        max_freq=None,  # None for samplerate/2
-        specgram_type='linear',  # 'linear', 'mfcc'
-        target_sample_rate=16000,  # sample rate
+        specgram_type='linear',  # 'linear', 'mfcc', 'fbank'
+        feat_dim=0,  # 'mfcc', 'fbank'
+        delat_delta=False,  # 'mfcc', 'fbank'
+        target_sample_rate=16000,  # target sample rate
        use_dB_normalization=True,
        target_dB=-20,
        random_seed=0,
@ -81,4 +85,6 @@ def get_cfg_defaults():
    """Get a yacs CfgNode object with default values for my_project."""
    # Return a clone so that the defaults will not be altered
    # This is for the "local variable" use pattern
-    return _C.clone()
+    config = _C.clone()
+    config.set_new_allowed(True)
+    return config
--- a/deepspeech/exps/deepspeech2/model.py
+++ b/deepspeech/exps/deepspeech2/model.py
@ -12,46 +12,38 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 """Contains DeepSpeech2 model."""
-
-import io
-import sys
-import os
 import time
-import logging
-import numpy as np
 from collections import defaultdict
-from functools import partial
 from pathlib import Path

+import numpy as np
 import paddle
 from paddle import distributed as dist
 from paddle.io import DataLoader

-from deepspeech.training import Trainer
-from deepspeech.training.gradclip import MyClipGradByGlobalNorm
-
-from deepspeech.utils import mp_tools
-from deepspeech.utils import layer_tools
-from deepspeech.utils import error_rate
-
 from deepspeech.io.collator import SpeechCollator
-from deepspeech.io.sampler import SortagradDistributedBatchSampler
-from deepspeech.io.sampler import SortagradBatchSampler
 from deepspeech.io.dataset import ManifestDataset
-
-from deepspeech.models.deepspeech2 import DeepSpeech2Model
+from deepspeech.io.sampler import SortagradBatchSampler
+from deepspeech.io.sampler import SortagradDistributedBatchSampler
 from deepspeech.models.deepspeech2 import DeepSpeech2InferModel
+from deepspeech.models.deepspeech2 import DeepSpeech2Model
+from deepspeech.training.gradclip import ClipGradByGlobalNormWithLog
+from deepspeech.training.trainer import Trainer
+from deepspeech.utils import error_rate
+from deepspeech.utils import layer_tools
+from deepspeech.utils import mp_tools
+from deepspeech.utils.log import Log

-logger = logging.getLogger(__name__)
+logger = Log(__name__).getlog()


 class DeepSpeech2Trainer(Trainer):
    def __init__(self, config, args):
        super().__init__(config, args)

-    def train_batch(self, batch_data):
+    def train_batch(self, batch_index, batch_data, msg):
        start = time.time()
-        self.model.train()
+
        loss = self.model(*batch_data)
        loss.backward()
        layer_tools.print_grads(self.model, print_func=None)
@ -63,46 +55,49 @@ class DeepSpeech2Trainer(Trainer):
        losses_np = {
            'train_loss': float(loss),
        }
-        msg = "Train: Rank: {}, ".format(dist.get_rank())
-        msg += "epoch: {}, ".format(self.epoch)
-        msg += "step: {}, ".format(self.iteration)
-        msg += "time: {:>.3f}s, ".format(iteration_time)
+        msg += "train time: {:>.3f}s, ".format(iteration_time)
+        msg += "batch size: {}, ".format(self.config.data.batch_size)
        msg += ', '.join('{}: {:>.6f}'.format(k, v)
                         for k, v in losses_np.items())
-        self.logger.info(msg)
+        logger.info(msg)

        if dist.get_rank() == 0 and self.visualizer:
            for k, v in losses_np.items():
                self.visualizer.add_scalar("train/{}".format(k), v,
                                           self.iteration)
+        self.iteration += 1

-    @mp_tools.rank_zero_only
    @paddle.no_grad()
    def valid(self):
-        self.logger.info(
-            f"Valid Total Examples: {len(self.valid_loader.dataset)}")
+        logger.info(f"Valid Total Examples: {len(self.valid_loader.dataset)}")
        self.model.eval()
        valid_losses = defaultdict(list)
+        num_seen_utts = 1
+        total_loss = 0.0
        for i, batch in enumerate(self.valid_loader):
            loss = self.model(*batch)
-
-            valid_losses['val_loss'].append(float(loss))
-
-        # write visual log
-        valid_losses = {k: np.mean(v) for k, v in valid_losses.items()}
-
-        # logging
-        msg = f"Valid: Rank: {dist.get_rank()}, "
-        msg += "epoch: {}, ".format(self.epoch)
-        msg += "step: {}, ".format(self.iteration)
-        msg += ', '.join('{}: {:>.6f}'.format(k, v)
-                         for k, v in valid_losses.items())
-        self.logger.info(msg)
-
-        if self.visualizer:
-            for k, v in valid_losses.items():
-                self.visualizer.add_scalar("valid/{}".format(k), v,
-                                           self.iteration)
+            if paddle.isfinite(loss):
+                num_utts = batch[0].shape[0]
+                num_seen_utts += num_utts
+                total_loss += float(loss) * num_utts
+                valid_losses['val_loss'].append(float(loss))
+
+            if (i + 1) % self.config.training.log_interval == 0:
+                valid_dump = {k: np.mean(v) for k, v in valid_losses.items()}
+                valid_dump['val_history_loss'] = total_loss / num_seen_utts
+
+                # logging
+                msg = f"Valid: Rank: {dist.get_rank()}, "
+                msg += "epoch: {}, ".format(self.epoch)
+                msg += "step: {}, ".format(self.iteration)
+                msg += "batch : {}/{}, ".format(i + 1, len(self.valid_loader))
+                msg += ', '.join('{}: {:>.6f}'.format(k, v)
+                                 for k, v in valid_dump.items())
+                logger.info(msg)
+
+        logger.info('Rank {} Val info val_loss {}'.format(
+            dist.get_rank(), total_loss / num_seen_utts))
+        return total_loss, num_seen_utts

    def setup_model(self):
        config = self.config
@ -118,9 +113,11 @@ class DeepSpeech2Trainer(Trainer):
        if self.parallel:
            model = paddle.DataParallel(model)

-        layer_tools.print_params(model, self.logger.info)
+        logger.info(f"{model}")
+        layer_tools.print_params(model, logger.info)

-        grad_clip = MyClipGradByGlobalNorm(config.training.global_grad_clip)
+        grad_clip = ClipGradByGlobalNormWithLog(
+            config.training.global_grad_clip)
        lr_scheduler = paddle.optimizer.lr.ExponentialDecay(
            learning_rate=config.training.lr,
            gamma=config.training.lr_decay,
@ -135,48 +132,19 @@ class DeepSpeech2Trainer(Trainer):
        self.model = model
        self.optimizer = optimizer
        self.lr_scheduler = lr_scheduler
-        self.logger.info("Setup model/optimizer/lr_scheduler!")
+        logger.info("Setup model/optimizer/lr_scheduler!")

    def setup_dataloader(self):
-        config = self.config
+        config = self.config.clone()
+        config.defrost()
+        config.data.keep_transcription_text = False
+
+        config.data.manifest = config.data.train_manifest
+        train_dataset = ManifestDataset.from_config(config)

-        train_dataset = ManifestDataset(
-            config.data.train_manifest,
-            config.data.vocab_filepath,
-            config.data.mean_std_filepath,
-            augmentation_config=io.open(
-                config.data.augmentation_config, mode='r',
-                encoding='utf8').read(),
-            max_duration=config.data.max_duration,
-            min_duration=config.data.min_duration,
-            stride_ms=config.data.stride_ms,
-            window_ms=config.data.window_ms,
-            n_fft=config.data.n_fft,
-            max_freq=config.data.max_freq,
-            target_sample_rate=config.data.target_sample_rate,
-            specgram_type=config.data.specgram_type,
-            use_dB_normalization=config.data.use_dB_normalization,
-            target_dB=config.data.target_dB,
-            random_seed=config.data.random_seed,
-            keep_transcription_text=False)
-
-        dev_dataset = ManifestDataset(
-            config.data.dev_manifest,
-            config.data.vocab_filepath,
-            config.data.mean_std_filepath,
-            augmentation_config="{}",
-            max_duration=config.data.max_duration,
-            min_duration=config.data.min_duration,
-            stride_ms=config.data.stride_ms,
-            window_ms=config.data.window_ms,
-            n_fft=config.data.n_fft,
-            max_freq=config.data.max_freq,
-            target_sample_rate=config.data.target_sample_rate,
-            specgram_type=config.data.specgram_type,
-            use_dB_normalization=config.data.use_dB_normalization,
-            target_dB=config.data.target_dB,
-            random_seed=config.data.random_seed,
-            keep_transcription_text=False)
+        config.data.manifest = config.data.dev_manifest
+        config.data.augmentation_config = ""
+        dev_dataset = ManifestDataset.from_config(config)

        if self.parallel:
            batch_sampler = SortagradDistributedBatchSampler(
@ -197,7 +165,7 @@ class DeepSpeech2Trainer(Trainer):
                sortagrad=config.data.sortagrad,
                shuffle_method=config.data.shuffle_method)

-        collate_fn = SpeechCollator(is_training=True)
+        collate_fn = SpeechCollator(keep_transcription_text=False)
        self.train_loader = DataLoader(
            train_dataset,
            batch_sampler=batch_sampler,
@ -209,7 +177,7 @@ class DeepSpeech2Trainer(Trainer):
            shuffle=False,
            drop_last=False,
            collate_fn=collate_fn)
-        self.logger.info("Setup train/valid Dataloader!")
+        logger.info("Setup train/valid Dataloader!")


 class DeepSpeech2Tester(DeepSpeech2Trainer):
@ -225,7 +193,7 @@ class DeepSpeech2Tester(DeepSpeech2Trainer):
            trans.append(''.join([chr(i) for i in ids]))
        return trans

-    def compute_metrics(self, audio, texts, audio_len, texts_len):
+    def compute_metrics(self, audio, audio_len, texts, texts_len):
        cfg = self.config.decoding
        errors_sum, len_refs, num_ins = 0.0, 0, 0
        errors_func = error_rate.char_errors if cfg.error_rate_type == 'cer' else error_rate.word_errors
@ -252,11 +220,10 @@ class DeepSpeech2Tester(DeepSpeech2Trainer):
            errors_sum += errors
            len_refs += len_ref
            num_ins += 1
-            self.logger.info(
-                "\nTarget Transcription: %s\nOutput Transcription: %s" %
-                (target, result))
-            self.logger.info("Current error rate [%s] = %f" % (
-                cfg.error_rate_type, error_rate_func(target, result)))
+            logger.info("\nTarget Transcription: %s\nOutput Transcription: %s" %
+                        (target, result))
+            logger.info("Current error rate [%s] = %f" %
+                        (cfg.error_rate_type, error_rate_func(target, result)))

        return dict(
            errors_sum=errors_sum,
@ -268,8 +235,7 @@ class DeepSpeech2Tester(DeepSpeech2Trainer):
    @mp_tools.rank_zero_only
    @paddle.no_grad()
    def test(self):
-        self.logger.info(
-            f"Test Total Examples: {len(self.test_loader.dataset)}")
+        logger.info(f"Test Total Examples: {len(self.test_loader.dataset)}")
        self.model.eval()
        cfg = self.config
        error_rate_type = None
@ -281,19 +247,19 @@ class DeepSpeech2Tester(DeepSpeech2Trainer):
            len_refs += metrics['len_refs']
            num_ins += metrics['num_ins']
            error_rate_type = metrics['error_rate_type']
-            self.logger.info("Error rate [%s] (%d/?) = %f" %
-                             (error_rate_type, num_ins, errors_sum / len_refs))
+            logger.info("Error rate [%s] (%d/?) = %f" %
+                        (error_rate_type, num_ins, errors_sum / len_refs))

        # logging
        msg = "Test: "
        msg += "epoch: {}, ".format(self.epoch)
        msg += "step: {}, ".format(self.iteration)
-        msg += ", Final error rate [%s] (%d/%d) = %f" % (
+        msg += "Final error rate [%s] (%d/%d) = %f" % (
            error_rate_type, num_ins, num_ins, errors_sum / len_refs)
-        self.logger.info(msg)
+        logger.info(msg)

    def run_test(self):
-        self.resume_or_load()
+        self.resume_or_scratch()
        try:
            self.test()
        except KeyboardInterrupt:
@ -329,7 +295,6 @@ class DeepSpeech2Tester(DeepSpeech2Trainer):

        self.setup_output_dir()
        self.setup_checkpointer()
-        self.setup_logger()

        self.setup_dataloader()
        self.setup_model()
@ -348,28 +313,25 @@ class DeepSpeech2Tester(DeepSpeech2Trainer):
            use_gru=config.model.use_gru,
            share_rnn_weights=config.model.share_rnn_weights)
        self.model = model
-        self.logger.info("Setup model!")
+        logger.info("Setup model!")

    def setup_dataloader(self):
-        config = self.config
+        config = self.config.clone()
+        config.defrost()
        # return raw text
-        test_dataset = ManifestDataset(
-            config.data.test_manifest,
-            config.data.vocab_filepath,
-            config.data.mean_std_filepath,
-            augmentation_config="{}",
-            max_duration=config.data.max_duration,
-            min_duration=config.data.min_duration,
-            stride_ms=config.data.stride_ms,
-            window_ms=config.data.window_ms,
-            n_fft=config.data.n_fft,
-            max_freq=config.data.max_freq,
-            target_sample_rate=config.data.target_sample_rate,
-            specgram_type=config.data.specgram_type,
-            use_dB_normalization=config.data.use_dB_normalization,
-            target_dB=config.data.target_dB,
-            random_seed=config.data.random_seed,
-            keep_transcription_text=True)
+
+        config.data.manifest = config.data.test_manifest
+        config.data.keep_transcription_text = True
+        config.data.augmentation_config = ""
+        # filter test examples, will cause less examples, but no mismatch with training
+        # and can use large batch size , save training time, so filter test egs now.
+        # config.data.min_input_len = 0.0  # second
+        # config.data.max_input_len = float('inf')  # second
+        # config.data.min_output_len = 0.0  # tokens
+        # config.data.max_output_len = float('inf')  # tokens
+        # config.data.min_output_input_ratio = 0.00
+        # config.data.max_output_input_ratio = float('inf')
+        test_dataset = ManifestDataset.from_config(config)

        # return text ord id
        self.test_loader = DataLoader(
@ -377,8 +339,8 @@ class DeepSpeech2Tester(DeepSpeech2Trainer):
            batch_size=config.decoding.batch_size,
            shuffle=False,
            drop_last=False,
-            collate_fn=SpeechCollator(is_training=False))
-        self.logger.info("Setup test Dataloader!")
+            collate_fn=SpeechCollator(keep_transcription_text=True))
+        logger.info("Setup test Dataloader!")

    def setup_output_dir(self):
        """Create a directory used for output.
@ -393,25 +355,3 @@ class DeepSpeech2Tester(DeepSpeech2Trainer):
            output_dir.mkdir(parents=True, exist_ok=True)

        self.output_dir = output_dir
-
-    def setup_logger(self):
-        """Initialize a text logger to log the experiment.
-        
-        Each process has its own text logger. The logging message is write to 
-        the standard output and a text file named ``worker_n.log`` in the 
-        output directory, where ``n`` means the rank of the process. 
-        """
-        format = '[%(levelname)s %(asctime)s %(filename)s:%(lineno)d] %(message)s'
-        formatter = logging.Formatter(fmt=format, datefmt='%Y/%m/%d %H:%M:%S')
-
-        logger.setLevel("INFO")
-
-        # global logger
-        stdout = True
-        save_path = ""
-        logging.basicConfig(
-            level=logging.DEBUG if stdout else logging.INFO,
-            format=format,
-            datefmt='%Y/%m/%d %H:%M:%S',
-            filename=save_path if not stdout else None)
-        self.logger = logger
--- a/deepspeech/exps/u2/init.py
+++ b/deepspeech/exps/u2/init.py
@ -0,0 +1,13 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
--- a/deepspeech/exps/u2/bin/export.py
+++ b/deepspeech/exps/u2/bin/export.py
@ -0,0 +1,48 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Export for U2 model."""
+from deepspeech.exps.u2.config import get_cfg_defaults
+from deepspeech.exps.u2.model import U2Tester as Tester
+from deepspeech.training.cli import default_argument_parser
+from deepspeech.utils.utility import print_arguments
+
+
+def main_sp(config, args):
+    exp = Tester(config, args)
+    exp.setup()
+    exp.run_export()
+
+
+def main(config, args):
+    main_sp(config, args)
+
+
+if __name__ == "__main__":
+    parser = default_argument_parser()
+    args = parser.parse_args()
+    print_arguments(args, globals())
+
+    # https://yaml.org/type/float.html
+    config = get_cfg_defaults()
+    if args.config:
+        config.merge_from_file(args.config)
+    if args.opts:
+        config.merge_from_list(args.opts)
+    config.freeze()
+    print(config)
+    if args.dump_config:
+        with open(args.dump_config, 'w') as f:
+            print(config, file=f)
+
+    main(config, args)
--- a/deepspeech/exps/deepspeech2/bin/infer.py
+++ b/deepspeech/exps/deepspeech2/bin/infer.py
@ -11,22 +11,15 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
-"""Inferer for DeepSpeech2 model."""
-
-import io
-import logging
-import argparse
-import functools
-
-from paddle import distributed as dist
+"""Evaluation for U2 model."""
+import cProfile

+from deepspeech.exps.u2.config import get_cfg_defaults
+from deepspeech.exps.u2.model import U2Tester as Tester
 from deepspeech.training.cli import default_argument_parser
 from deepspeech.utils.utility import print_arguments
-from deepspeech.utils.error_rate import char_errors, word_errors

 # TODO(hui zhang): dynamic load 
-from deepspeech.exps.deepspeech2.config import get_cfg_defaults
-from deepspeech.exps.deepspeech2.model import DeepSpeech2Tester as Tester


 def main_sp(config, args):
@ -42,7 +35,7 @@ def main(config, args):
 if __name__ == "__main__":
    parser = default_argument_parser()
    args = parser.parse_args()
-    print_arguments(args)
+    print_arguments(args, globals())

    # https://yaml.org/type/float.html
    config = get_cfg_defaults()
@ -56,4 +49,7 @@ if __name__ == "__main__":
        with open(args.dump_config, 'w') as f:
            print(config, file=f)

-    main(config, args)
+    # Setting for profiling
+    pr = cProfile.Profile()
+    pr.runcall(main, config, args)
+    pr.dump_stats('test.profile')
--- a/deepspeech/exps/u2/bin/train.py
+++ b/deepspeech/exps/u2/bin/train.py
@ -0,0 +1,59 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Trainer for U2 model."""
+import cProfile
+import os
+
+from paddle import distributed as dist
+
+from deepspeech.exps.u2.config import get_cfg_defaults
+from deepspeech.exps.u2.model import U2Trainer as Trainer
+from deepspeech.training.cli import default_argument_parser
+from deepspeech.utils.utility import print_arguments
+
+
+def main_sp(config, args):
+    exp = Trainer(config, args)
+    exp.setup()
+    exp.run()
+
+
+def main(config, args):
+    if args.device == "gpu" and args.nprocs > 1:
+        dist.spawn(main_sp, args=(config, args), nprocs=args.nprocs)
+    else:
+        main_sp(config, args)
+
+
+if __name__ == "__main__":
+    parser = default_argument_parser()
+    args = parser.parse_args()
+    print_arguments(args, globals())
+
+    # https://yaml.org/type/float.html
+    config = get_cfg_defaults()
+    if args.config:
+        config.merge_from_file(args.config)
+    if args.opts:
+        config.merge_from_list(args.opts)
+    config.freeze()
+    print(config)
+    if args.dump_config:
+        with open(args.dump_config, 'w') as f:
+            print(config, file=f)
+
+    # Setting for profiling
+    pr = cProfile.Profile()
+    pr.runcall(main, config, args)
+    pr.dump_stats(os.path.join(args.output, 'train.profile'))
--- a/deepspeech/exps/u2/config.py
+++ b/deepspeech/exps/u2/config.py
@ -0,0 +1,38 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from yacs.config import CfgNode
+
+from deepspeech.exps.u2.model import U2Tester
+from deepspeech.exps.u2.model import U2Trainer
+from deepspeech.io.dataset import ManifestDataset
+from deepspeech.models.u2 import U2Model
+
+_C = CfgNode()
+
+_C.data = ManifestDataset.params()
+
+_C.model = U2Model.params()
+
+_C.training = U2Trainer.params()
+
+_C.decoding = U2Tester.params()
+
+
+def get_cfg_defaults():
+    """Get a yacs CfgNode object with default values for my_project."""
+    # Return a clone so that the defaults will not be altered
+    # This is for the "local variable" use pattern
+    config = _C.clone()
+    config.set_new_allowed(True)
+    return config
--- a/deepspeech/exps/u2/model.py
+++ b/deepspeech/exps/u2/model.py
@ -0,0 +1,545 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Contains U2 model."""
+import json
+import os
+import sys
+import time
+from collections import defaultdict
+from pathlib import Path
+from typing import Optional
+
+import numpy as np
+import paddle
+from paddle import distributed as dist
+from paddle.io import DataLoader
+from yacs.config import CfgNode
+
+from deepspeech.io.collator import SpeechCollator
+from deepspeech.io.dataset import ManifestDataset
+from deepspeech.io.sampler import SortagradBatchSampler
+from deepspeech.io.sampler import SortagradDistributedBatchSampler
+from deepspeech.models.u2 import U2Model
+from deepspeech.training.gradclip import ClipGradByGlobalNormWithLog
+from deepspeech.training.scheduler import WarmupLR
+from deepspeech.training.trainer import Trainer
+from deepspeech.utils import error_rate
+from deepspeech.utils import layer_tools
+from deepspeech.utils import mp_tools
+from deepspeech.utils.log import Log
+
+logger = Log(__name__).getlog()
+
+
+class U2Trainer(Trainer):
+    @classmethod
+    def params(cls, config: Optional[CfgNode]=None) -> CfgNode:
+        # training config
+        default = CfgNode(
+            dict(
+                n_epoch=50,  # train epochs
+                log_interval=100,  # steps
+                accum_grad=1,  # accum grad by # steps
+                global_grad_clip=5.0,  # the global norm clip
+            ))
+        default.optim = 'adam'
+        default.optim_conf = CfgNode(
+            dict(
+                lr=5e-4,  # learning rate
+                weight_decay=1e-6,  # the coeff of weight decay
+            ))
+        default.scheduler = 'warmuplr'
+        default.scheduler_conf = CfgNode(
+            dict(
+                warmup_steps=25000,
+                lr_decay=1.0,  # learning rate decay
+            ))
+
+        if config is not None:
+            config.merge_from_other_cfg(default)
+        return default
+
+    def __init__(self, config, args):
+        super().__init__(config, args)
+
+    def train_batch(self, batch_index, batch_data, msg):
+        train_conf = self.config.training
+        start = time.time()
+
+        loss, attention_loss, ctc_loss = self.model(*batch_data)
+        # loss div by `batch_size * accum_grad`
+        loss /= train_conf.accum_grad
+        loss.backward()
+        layer_tools.print_grads(self.model, print_func=None)
+
+        losses_np = {'loss': float(loss) * train_conf.accum_grad}
+        if attention_loss:
+            losses_np['att_loss'] = float(attention_loss)
+        if ctc_loss:
+            losses_np['ctc_loss'] = float(ctc_loss)
+
+        if (batch_index + 1) % train_conf.accum_grad == 0:
+            self.optimizer.step()
+            self.optimizer.clear_grad()
+            self.lr_scheduler.step()
+            self.iteration += 1
+
+        iteration_time = time.time() - start
+
+        if (batch_index + 1) % train_conf.log_interval == 0:
+            msg += "train time: {:>.3f}s, ".format(iteration_time)
+            msg += "batch size: {}, ".format(self.config.data.batch_size)
+            msg += "accum: {}, ".format(train_conf.accum_grad)
+            msg += ', '.join('{}: {:>.6f}'.format(k, v)
+                             for k, v in losses_np.items())
+            logger.info(msg)
+
+            if dist.get_rank() == 0 and self.visualizer:
+                losses_np_v = losses_np.copy()
+                losses_np_v.update({"lr": self.lr_scheduler()})
+                self.visualizer.add_scalars("step", losses_np_v,
+                                            self.iteration - 1)
+
+    @paddle.no_grad()
+    def valid(self):
+        self.model.eval()
+        logger.info(f"Valid Total Examples: {len(self.valid_loader.dataset)}")
+        valid_losses = defaultdict(list)
+        num_seen_utts = 1
+        total_loss = 0.0
+        for i, batch in enumerate(self.valid_loader):
+            loss, attention_loss, ctc_loss = self.model(*batch)
+            if paddle.isfinite(loss):
+                num_utts = batch[0].shape[0]
+                num_seen_utts += num_utts
+                total_loss += float(loss) * num_utts
+                valid_losses['val_loss'].append(float(loss))
+                if attention_loss:
+                    valid_losses['val_att_loss'].append(float(attention_loss))
+                if ctc_loss:
+                    valid_losses['val_ctc_loss'].append(float(ctc_loss))
+
+            if (i + 1) % self.config.training.log_interval == 0:
+                valid_dump = {k: np.mean(v) for k, v in valid_losses.items()}
+                valid_dump['val_history_loss'] = total_loss / num_seen_utts
+
+                # logging
+                msg = f"Valid: Rank: {dist.get_rank()}, "
+                msg += "epoch: {}, ".format(self.epoch)
+                msg += "step: {}, ".format(self.iteration)
+                msg += "batch: {}/{}, ".format(i + 1, len(self.valid_loader))
+                msg += ', '.join('{}: {:>.6f}'.format(k, v)
+                                 for k, v in valid_dump.items())
+                logger.info(msg)
+
+        logger.info('Rank {} Val info val_loss {}'.format(
+            dist.get_rank(), total_loss / num_seen_utts))
+        return total_loss, num_seen_utts
+
+    def train(self):
+        """The training process control by step."""
+        # !!!IMPORTANT!!!
+        # Try to export the model by script, if fails, we should refine
+        # the code to satisfy the script export requirements
+        # script_model = paddle.jit.to_static(self.model)
+        # script_model_path = str(self.checkpoint_dir / 'init')
+        # paddle.jit.save(script_model, script_model_path)
+
+        from_scratch = self.resume_or_scratch()
+        if from_scratch:
+            # save init model, i.e. 0 epoch
+            self.save(tag='init')
+
+        self.lr_scheduler.step(self.iteration)
+        if self.parallel:
+            self.train_loader.batch_sampler.set_epoch(self.epoch)
+
+        logger.info(f"Train Total Examples: {len(self.train_loader.dataset)}")
+        while self.epoch < self.config.training.n_epoch:
+            self.model.train()
+            try:
+                data_start_time = time.time()
+                for batch_index, batch in enumerate(self.train_loader):
+                    dataload_time = time.time() - data_start_time
+                    msg = "Train: Rank: {}, ".format(dist.get_rank())
+                    msg += "epoch: {}, ".format(self.epoch)
+                    msg += "step: {}, ".format(self.iteration)
+                    msg += "batch : {}/{}, ".format(batch_index + 1,
+                                                    len(self.train_loader))
+                    msg += "lr: {:>.8f}, ".format(self.lr_scheduler())
+                    msg += "data time: {:>.3f}s, ".format(dataload_time)
+                    self.train_batch(batch_index, batch, msg)
+                    data_start_time = time.time()
+            except Exception as e:
+                logger.error(e)
+                raise e
+
+            total_loss, num_seen_utts = self.valid()
+            if dist.get_world_size() > 1:
+                num_seen_utts = paddle.to_tensor(num_seen_utts)
+                # the default operator in all_reduce function is sum.
+                dist.all_reduce(num_seen_utts)
+                total_loss = paddle.to_tensor(total_loss)
+                dist.all_reduce(total_loss)
+                cv_loss = total_loss / num_seen_utts
+                cv_loss = float(cv_loss)
+            else:
+                cv_loss = total_loss / num_seen_utts
+
+            logger.info(
+                'Epoch {} Val info val_loss {}'.format(self.epoch, cv_loss))
+            if self.visualizer:
+                self.visualizer.add_scalars(
+                    'epoch', {'cv_loss': cv_loss,
+                              'lr': self.lr_scheduler()}, self.epoch)
+            self.save(tag=self.epoch, infos={'val_loss': cv_loss})
+            self.new_epoch()
+
+    def setup_dataloader(self):
+        config = self.config.clone()
+        config.defrost()
+        config.data.keep_transcription_text = False
+
+        # train/valid dataset, return token ids
+        config.data.manifest = config.data.train_manifest
+        train_dataset = ManifestDataset.from_config(config)
+
+        config.data.manifest = config.data.dev_manifest
+        config.data.augmentation_config = ""
+        dev_dataset = ManifestDataset.from_config(config)
+
+        collate_fn = SpeechCollator(keep_transcription_text=False)
+        if self.parallel:
+            batch_sampler = SortagradDistributedBatchSampler(
+                train_dataset,
+                batch_size=config.data.batch_size,
+                num_replicas=None,
+                rank=None,
+                shuffle=True,
+                drop_last=True,
+                sortagrad=config.data.sortagrad,
+                shuffle_method=config.data.shuffle_method)
+        else:
+            batch_sampler = SortagradBatchSampler(
+                train_dataset,
+                shuffle=True,
+                batch_size=config.data.batch_size,
+                drop_last=True,
+                sortagrad=config.data.sortagrad,
+                shuffle_method=config.data.shuffle_method)
+        self.train_loader = DataLoader(
+            train_dataset,
+            batch_sampler=batch_sampler,
+            collate_fn=collate_fn,
+            num_workers=config.data.num_workers, )
+        self.valid_loader = DataLoader(
+            dev_dataset,
+            batch_size=config.data.batch_size,
+            shuffle=False,
+            drop_last=False,
+            collate_fn=collate_fn)
+
+        # test dataset, return raw text
+        config.data.manifest = config.data.test_manifest
+        config.data.keep_transcription_text = True
+        config.data.augmentation_config = ""
+        # filter test examples, will cause less examples, but no mismatch with training
+        # and can use large batch size , save training time, so filter test egs now.
+        # config.data.min_input_len = 0.0  # second
+        # config.data.max_input_len = float('inf')  # second
+        # config.data.min_output_len = 0.0  # tokens
+        # config.data.max_output_len = float('inf')  # tokens
+        # config.data.min_output_input_ratio = 0.00
+        # config.data.max_output_input_ratio = float('inf')
+        test_dataset = ManifestDataset.from_config(config)
+        # return text ord id
+        self.test_loader = DataLoader(
+            test_dataset,
+            batch_size=config.decoding.batch_size,
+            shuffle=False,
+            drop_last=False,
+            collate_fn=SpeechCollator(keep_transcription_text=True))
+        logger.info("Setup train/valid/test Dataloader!")
+
+    def setup_model(self):
+        config = self.config
+        model_conf = config.model
+        model_conf.defrost()
+        model_conf.input_dim = self.train_loader.dataset.feature_size
+        model_conf.output_dim = self.train_loader.dataset.vocab_size
+        model_conf.freeze()
+        model = U2Model.from_config(model_conf)
+
+        if self.parallel:
+            model = paddle.DataParallel(model)
+
+        logger.info(f"{model}")
+        layer_tools.print_params(model, logger.info)
+
+        train_config = config.training
+        optim_type = train_config.optim
+        optim_conf = train_config.optim_conf
+        scheduler_type = train_config.scheduler
+        scheduler_conf = train_config.scheduler_conf
+
+        grad_clip = ClipGradByGlobalNormWithLog(train_config.global_grad_clip)
+        weight_decay = paddle.regularizer.L2Decay(optim_conf.weight_decay)
+
+        if scheduler_type == 'expdecaylr':
+            lr_scheduler = paddle.optimizer.lr.ExponentialDecay(
+                learning_rate=optim_conf.lr,
+                gamma=scheduler_conf.lr_decay,
+                verbose=False)
+        elif scheduler_type == 'warmuplr':
+            lr_scheduler = WarmupLR(
+                learning_rate=optim_conf.lr,
+                warmup_steps=scheduler_conf.warmup_steps,
+                verbose=False)
+        else:
+            raise ValueError(f"Not support scheduler: {scheduler_type}")
+
+        if optim_type == 'adam':
+            optimizer = paddle.optimizer.Adam(
+                learning_rate=lr_scheduler,
+                parameters=model.parameters(),
+                weight_decay=weight_decay,
+                grad_clip=grad_clip)
+        else:
+            raise ValueError(f"Not support optim: {optim_type}")
+
+        self.model = model
+        self.optimizer = optimizer
+        self.lr_scheduler = lr_scheduler
+        logger.info("Setup model/optimizer/lr_scheduler!")
+
+
+class U2Tester(U2Trainer):
+    @classmethod
+    def params(cls, config: Optional[CfgNode]=None) -> CfgNode:
+        # decoding config
+        default = CfgNode(
+            dict(
+                alpha=2.5,  # Coef of LM for beam search.
+                beta=0.3,  # Coef of WC for beam search.
+                cutoff_prob=1.0,  # Cutoff probability for pruning.
+                cutoff_top_n=40,  # Cutoff number for pruning.
+                lang_model_path='models/lm/common_crawl_00.prune01111.trie.klm',  # Filepath for language model.
+                decoding_method='attention',  # Decoding method. Options: 'attention', 'ctc_greedy_search',
+                # 'ctc_prefix_beam_search', 'attention_rescoring'
+                error_rate_type='wer',  # Error rate type for evaluation. Options `wer`, 'cer'
+                num_proc_bsearch=8,  # # of CPUs for beam search.
+                beam_size=10,  # Beam search width.
+                batch_size=16,  # decoding batch size
+                ctc_weight=0.0,  # ctc weight for attention rescoring decode mode.
+                decoding_chunk_size=-1,  # decoding chunk size. Defaults to -1.
+                # <0: for decoding, use full chunk.
+                # >0: for decoding, use fixed chunk size as set.
+                # 0: used for training, it's prohibited here. 
+                num_decoding_left_chunks=-1,  # number of left chunks for decoding. Defaults to -1.
+                simulate_streaming=False,  # simulate streaming inference. Defaults to False.
+            ))
+
+        if config is not None:
+            config.merge_from_other_cfg(default)
+        return default
+
+    def __init__(self, config, args):
+        super().__init__(config, args)
+
+    def ordid2token(self, texts, texts_len):
+        """ ord() id to chr() chr """
+        trans = []
+        for text, n in zip(texts, texts_len):
+            n = n.numpy().item()
+            ids = text[:n]
+            trans.append(''.join([chr(i) for i in ids]))
+        return trans
+
+    def compute_metrics(self, audio, audio_len, texts, texts_len, fout=None):
+        cfg = self.config.decoding
+        errors_sum, len_refs, num_ins = 0.0, 0, 0
+        errors_func = error_rate.char_errors if cfg.error_rate_type == 'cer' else error_rate.word_errors
+        error_rate_func = error_rate.cer if cfg.error_rate_type == 'cer' else error_rate.wer
+
+        start_time = time.time()
+        text_feature = self.test_loader.dataset.text_feature
+        target_transcripts = self.ordid2token(texts, texts_len)
+        result_transcripts = self.model.decode(
+            audio,
+            audio_len,
+            text_feature=text_feature,
+            decoding_method=cfg.decoding_method,
+            lang_model_path=cfg.lang_model_path,
+            beam_alpha=cfg.alpha,
+            beam_beta=cfg.beta,
+            beam_size=cfg.beam_size,
+            cutoff_prob=cfg.cutoff_prob,
+            cutoff_top_n=cfg.cutoff_top_n,
+            num_processes=cfg.num_proc_bsearch,
+            ctc_weight=cfg.ctc_weight,
+            decoding_chunk_size=cfg.decoding_chunk_size,
+            num_decoding_left_chunks=cfg.num_decoding_left_chunks,
+            simulate_streaming=cfg.simulate_streaming)
+        decode_time = time.time() - start_time
+
+        for target, result in zip(target_transcripts, result_transcripts):
+            errors, len_ref = errors_func(target, result)
+            errors_sum += errors
+            len_refs += len_ref
+            num_ins += 1
+            if fout:
+                fout.write(result + "\n")
+            logger.info("\nTarget Transcription: %s\nOutput Transcription: %s" %
+                        (target, result))
+            logger.info("One example error rate [%s] = %f" %
+                        (cfg.error_rate_type, error_rate_func(target, result)))
+
+        return dict(
+            errors_sum=errors_sum,
+            len_refs=len_refs,
+            num_ins=num_ins,  # num examples
+            error_rate=errors_sum / len_refs,
+            error_rate_type=cfg.error_rate_type,
+            num_frames=audio_len.sum().numpy().item(),
+            decode_time=decode_time)
+
+    @mp_tools.rank_zero_only
+    @paddle.no_grad()
+    def test(self):
+        assert self.args.result_file
+        self.model.eval()
+        logger.info(f"Test Total Examples: {len(self.test_loader.dataset)}")
+
+        stride_ms = self.test_loader.dataset.stride_ms
+        error_rate_type = None
+        errors_sum, len_refs, num_ins = 0.0, 0, 0
+        num_frames = 0.0
+        num_time = 0.0
+        with open(self.args.result_file, 'w') as fout:
+            for i, batch in enumerate(self.test_loader):
+                metrics = self.compute_metrics(*batch, fout=fout)
+                num_frames += metrics['num_frames']
+                num_time += metrics["decode_time"]
+                errors_sum += metrics['errors_sum']
+                len_refs += metrics['len_refs']
+                num_ins += metrics['num_ins']
+                error_rate_type = metrics['error_rate_type']
+                rtf = num_time / (num_frames * stride_ms)
+                logger.info(
+                    "RTF: %f, Error rate [%s] (%d/?) = %f" %
+                    (rtf, error_rate_type, num_ins, errors_sum / len_refs))
+
+        rtf = num_time / (num_frames * stride_ms)
+        msg = "Test: "
+        msg += "epoch: {}, ".format(self.epoch)
+        msg += "step: {}, ".format(self.iteration)
+        msg += "RTF: {}, ".format(rtf)
+        msg += "Final error rate [%s] (%d/%d) = %f" % (
+            error_rate_type, num_ins, num_ins, errors_sum / len_refs)
+        logger.info(msg)
+
+        # test meta results
+        err_meta_path = os.path.splitext(self.args.checkpoint_path)[0] + '.err'
+        err_type_str = "{}".format(error_rate_type)
+        with open(err_meta_path, 'w') as f:
+            data = json.dumps({
+                "epoch":
+                self.epoch,
+                "step":
+                self.iteration,
+                "rtf":
+                rtf,
+                error_rate_type:
+                errors_sum / len_refs,
+                "dataset_hour": (num_frames * stride_ms) / 1000.0 / 3600.0,
+                "process_hour":
+                num_time / 1000.0 / 3600.0,
+                "num_examples":
+                num_ins,
+                "err_sum":
+                errors_sum,
+                "ref_len":
+                len_refs,
+            })
+            f.write(data + '\n')
+
+    def run_test(self):
+        self.resume_or_scratch()
+        try:
+            self.test()
+        except KeyboardInterrupt:
+            sys.exit(-1)
+
+    def load_inferspec(self):
+        """infer model and input spec.
+
+        Returns:
+            nn.Layer: inference model
+            List[paddle.static.InputSpec]: input spec.
+        """
+        from deepspeech.models.u2 import U2InferModel
+        infer_model = U2InferModel.from_pretrained(self.test_loader.dataset,
+                                                   self.config.model.clone(),
+                                                   self.args.checkpoint_path)
+        feat_dim = self.test_loader.dataset.feature_size
+        input_spec = [
+            paddle.static.InputSpec(
+                shape=[None, feat_dim, None],
+                dtype='float32'),  # audio, [B,D,T]
+            paddle.static.InputSpec(shape=[None],
+                                    dtype='int64'),  # audio_length, [B]
+        ]
+        return infer_model, input_spec
+
+    def export(self):
+        infer_model, input_spec = self.load_inferspec()
+        assert isinstance(input_spec, list), type(input_spec)
+        infer_model.eval()
+        static_model = paddle.jit.to_static(infer_model, input_spec=input_spec)
+        logger.info(f"Export code: {static_model.forward.code}")
+        paddle.jit.save(static_model, self.args.export_path)
+
+    def run_export(self):
+        try:
+            self.export()
+        except KeyboardInterrupt:
+            sys.exit(-1)
+
+    def setup(self):
+        """Setup the experiment.
+        """
+        paddle.set_device(self.args.device)
+
+        self.setup_output_dir()
+        self.setup_checkpointer()
+
+        self.setup_dataloader()
+        self.setup_model()
+
+        self.iteration = 0
+        self.epoch = 0
+
+    def setup_output_dir(self):
+        """Create a directory used for output.
+        """
+        # output dir
+        if self.args.output:
+            output_dir = Path(self.args.output).expanduser()
+            output_dir.mkdir(parents=True, exist_ok=True)
+        else:
+            output_dir = Path(
+                self.args.checkpoint_path).expanduser().parent.parent
+            output_dir.mkdir(parents=True, exist_ok=True)
+
+        self.output_dir = output_dir
--- a/deepspeech/frontend/audio.py
+++ b/deepspeech/frontend/audio.py
@ -12,17 +12,16 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 """Contains the audio segment class."""
-
-import numpy as np
+import copy
 import io
-import struct
+import random
 import re
-import soundfile
+import struct
+
+import numpy as np
 import resampy
+import soundfile
 from scipy import signal
-import random
-import copy
-import io


 class AudioSegment(object):
@ -299,6 +298,18 @@ class AudioSegment(object):
        samples = self._convert_samples_from_float32(self._samples, dtype)
        return samples.tostring()

+    def to(self, dtype='int16'):
+        """Create a `dtype` audio content.
+        
+        :param dtype: Data type for export samples. Options: 'int16', 'int32',
+                      'float32', 'float64'. Default is 'float32'.
+        :type dtype: str
+        :return: np.ndarray containing `dtype` audio content.
+        :rtype: str
+        """
+        samples = self._convert_samples_from_float32(self._samples, dtype)
+        return samples
+
    def gain_db(self, gain):
        """Apply gain in decibels to samples.

@ -322,14 +333,25 @@ class AudioSegment(object):
        :type speed_rate: float
        :raises ValueError: If speed_rate <= 0.0.
        """
+        if speed_rate == 1.0:
+            return
        if speed_rate <= 0:
            raise ValueError("speed_rate should be greater than zero.")
+
+        # numpy
        old_length = self._samples.shape[0]
        new_length = int(old_length / speed_rate)
        old_indices = np.arange(old_length)
        new_indices = np.linspace(start=0, stop=old_length, num=new_length)
        self._samples = np.interp(new_indices, old_indices, self._samples)

+        # sox, slow
+        # tfm = sox.Transformer()
+        # tfm.set_globals(multithread=False)
+        # tfm.speed(speed_rate)
+        # self._samples = tfm.build_array(
+        #     input_array=self._samples, sample_rate_in=self._sample_rate).copy()
+
    def normalize(self, target_db=-20, max_gain_db=300.0):
        """Normalize audio to be of the desired RMS value in decibels.

--- a/deepspeech/frontend/augmentor/augmentation.py
+++ b/deepspeech/frontend/augmentor/augmentation.py
@ -12,17 +12,19 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 """Contains the data augmentation pipeline."""
-
 import json
-import random
-from deepspeech.frontend.augmentor.volume_perturb import VolumePerturbAugmentor
-from deepspeech.frontend.augmentor.shift_perturb import ShiftPerturbAugmentor
-from deepspeech.frontend.augmentor.speed_perturb import SpeedPerturbAugmentor
-from deepspeech.frontend.augmentor.noise_perturb import NoisePerturbAugmentor
+
+import numpy as np
+
 from deepspeech.frontend.augmentor.impulse_response import ImpulseResponseAugmentor
-from deepspeech.frontend.augmentor.resample import ResampleAugmentor
+from deepspeech.frontend.augmentor.noise_perturb import NoisePerturbAugmentor
 from deepspeech.frontend.augmentor.online_bayesian_normalization import \
-     OnlineBayesianNormalizationAugmentor
+    OnlineBayesianNormalizationAugmentor
+from deepspeech.frontend.augmentor.resample import ResampleAugmentor
+from deepspeech.frontend.augmentor.shift_perturb import ShiftPerturbAugmentor
+from deepspeech.frontend.augmentor.spec_augment import SpecAugmentor
+from deepspeech.frontend.augmentor.speed_perturb import SpeedPerturbAugmentor
+from deepspeech.frontend.augmentor.volume_perturb import VolumePerturbAugmentor


 class AugmentationPipeline():
@ -83,10 +85,13 @@ class AugmentationPipeline():
    :raises ValueError: If the augmentation json config is in incorrect format".
    """

-    def __init__(self, augmentation_config, random_seed=0):
-        self._rng = random.Random(random_seed)
+    def __init__(self, augmentation_config: str, random_seed=0):
+        self._rng = np.random.RandomState(random_seed)
+        self._spec_types = ('specaug')
        self._augmentors, self._rates = self._parse_pipeline_from(
-            augmentation_config)
+            augmentation_config, 'audio')
+        self._spec_augmentors, self._spec_rates = self._parse_pipeline_from(
+            augmentation_config, 'feature')

    def transform_audio(self, audio_segment):
        """Run the pre-processing pipeline for data augmentation.
@ -100,15 +105,41 @@ class AugmentationPipeline():
            if self._rng.uniform(0., 1.) < rate:
                augmentor.transform_audio(audio_segment)

-    def _parse_pipeline_from(self, config_json):
+    def transform_feature(self, spec_segment):
+        """spectrogram augmentation.
+         
+        Args:
+            spec_segment (np.ndarray): audio feature, (D, T).
+        """
+        for augmentor, rate in zip(self._spec_augmentors, self._spec_rates):
+            if self._rng.uniform(0., 1.) < rate:
+                spec_segment = augmentor.transform_feature(spec_segment)
+        return spec_segment
+
+    def _parse_pipeline_from(self, config_json, aug_type='audio'):
        """Parse the config json to build a augmentation pipelien."""
+        assert aug_type in ('audio', 'feature'), aug_type
        try:
            configs = json.loads(config_json)
+            audio_confs = []
+            feature_confs = []
+            for config in configs:
+                if config["type"] in self._spec_types:
+                    feature_confs.append(config)
+                else:
+                    audio_confs.append(config)
+
+            if aug_type == 'audio':
+                aug_confs = audio_confs
+            elif aug_type == 'feature':
+                aug_confs = feature_confs
+
            augmentors = [
                self._get_augmentor(config["type"], config["params"])
-                for config in configs
+                for config in aug_confs
            ]
-            rates = [config["prob"] for config in configs]
+            rates = [config["prob"] for config in aug_confs]
+
        except Exception as e:
            raise ValueError("Failed to parse the augmentation config json: "
                             "%s" % str(e))
@ -130,5 +161,7 @@ class AugmentationPipeline():
            return NoisePerturbAugmentor(self._rng, **params)
        elif augmentor_type == "impulse":
            return ImpulseResponseAugmentor(self._rng, **params)
+        elif augmentor_type == "specaug":
+            return SpecAugmentor(self._rng, **params)
        else:
            raise ValueError("Unknown augmentor type [%s]." % augmentor_type)
--- a/deepspeech/frontend/augmentor/base.py
+++ b/deepspeech/frontend/augmentor/base.py
@ -12,8 +12,8 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 """Contains the abstract base class for augmentation models."""
-
-from abc import ABCMeta, abstractmethod
+from abc import ABCMeta
+from abc import abstractmethod


 class AugmentorBase():
@ -40,4 +40,16 @@ class AugmentorBase():
        :param audio_segment: Audio segment to add effects to.
        :type audio_segment: AudioSegmenet|SpeechSegment
        """
-        pass
+        raise NotImplementedError
+
+    @abstractmethod
+    def transform_feature(self, spec_segment):
+        """Adds various effects to the input audo feature segment. Such effects
+        will augment the training data to make the model invariant to certain
+        types of time_mask or freq_mask in the real world, improving model's
+        generalization ability.
+        
+        Args:
+            spec_segment (Spectrogram): Spectrogram segment to add effects to.
+        """
+        raise NotImplementedError
--- a/deepspeech/frontend/augmentor/impulse_response.py
+++ b/deepspeech/frontend/augmentor/impulse_response.py
@ -12,10 +12,9 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 """Contains the impulse response augmentation model."""
-
+from deepspeech.frontend.audio import AudioSegment
 from deepspeech.frontend.augmentor.base import AugmentorBase
 from deepspeech.frontend.utility import read_manifest
-from deepspeech.frontend.audio import AudioSegment


 class ImpulseResponseAugmentor(AugmentorBase):
@ -39,6 +38,7 @@ class ImpulseResponseAugmentor(AugmentorBase):
        :param audio_segment: Audio segment to add effects to.
        :type audio_segment: AudioSegmenet|SpeechSegment
        """
-        impulse_json = self._rng.sample(self._impulse_manifest, 1)[0]
+        impulse_json = self._rng.choice(
+            self._impulse_manifest, 1, replace=False)[0]
        impulse_segment = AudioSegment.from_file(impulse_json['audio_filepath'])
        audio_segment.convolve(impulse_segment, allow_resample=True)
--- a/deepspeech/frontend/augmentor/noise_perturb.py
+++ b/deepspeech/frontend/augmentor/noise_perturb.py
@ -12,10 +12,9 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 """Contains the noise perturb augmentation model."""
-
+from deepspeech.frontend.audio import AudioSegment
 from deepspeech.frontend.augmentor.base import AugmentorBase
 from deepspeech.frontend.utility import read_manifest
-from deepspeech.frontend.audio import AudioSegment


 class NoisePerturbAugmentor(AugmentorBase):
@ -45,7 +44,7 @@ class NoisePerturbAugmentor(AugmentorBase):
        :param audio_segment: Audio segment to add effects to.
        :type audio_segment: AudioSegmenet|SpeechSegment
        """
-        noise_json = self._rng.sample(self._noise_manifest, 1)[0]
+        noise_json = self._rng.choice(self._noise_manifest, 1, replace=False)[0]
        if noise_json['duration'] < audio_segment.duration:
            raise RuntimeError("The duration of sampled noise audio is smaller "
                               "than the audio segment to add effects to.")
--- a/deepspeech/frontend/augmentor/online_bayesian_normalization.py
+++ b/deepspeech/frontend/augmentor/online_bayesian_normalization.py
@ -12,7 +12,6 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 """Contain the online bayesian normalization augmentation model."""
-
 from deepspeech.frontend.augmentor.base import AugmentorBase


--- a/deepspeech/frontend/augmentor/resample.py
+++ b/deepspeech/frontend/augmentor/resample.py
@ -12,7 +12,6 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 """Contain the resample augmentation model."""
-
 from deepspeech.frontend.augmentor.base import AugmentorBase


--- a/deepspeech/frontend/augmentor/shift_perturb.py
+++ b/deepspeech/frontend/augmentor/shift_perturb.py
@ -12,7 +12,6 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 """Contains the volume perturb augmentation model."""
-
 from deepspeech.frontend.augmentor.base import AugmentorBase


--- a/deepspeech/frontend/augmentor/spec_augment.py
+++ b/deepspeech/frontend/augmentor/spec_augment.py
@ -0,0 +1,170 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Contains the volume perturb augmentation model."""
+import numpy as np
+
+from deepspeech.frontend.augmentor.base import AugmentorBase
+from deepspeech.utils.log import Log
+
+logger = Log(__name__).getlog()
+
+
+class SpecAugmentor(AugmentorBase):
+    """Augmentation model for Time warping, Frequency masking, Time masking.
+
+    SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition
+        https://arxiv.org/abs/1904.08779
+        
+    SpecAugment on Large Scale Datasets
+        https://arxiv.org/abs/1912.05533
+    
+    """
+
+    def __init__(self,
+                 rng,
+                 F,
+                 T,
+                 n_freq_masks,
+                 n_time_masks,
+                 p=1.0,
+                 W=40,
+                 adaptive_number_ratio=0,
+                 adaptive_size_ratio=0,
+                 max_n_time_masks=20):
+        """SpecAugment class.
+        Args:
+            rng (random.Random): random generator object.
+            F (int): parameter for frequency masking
+            T (int): parameter for time masking
+            n_freq_masks (int): number of frequency masks
+            n_time_masks (int): number of time masks
+            p (float): parameter for upperbound of the time mask
+            W (int): parameter for time warping
+            adaptive_number_ratio (float): adaptive multiplicity ratio for time masking
+            adaptive_size_ratio (float): adaptive size ratio for time masking
+            max_n_time_masks (int): maximum number of time masking
+        """
+        super().__init__()
+        self._rng = rng
+
+        self.W = W
+        self.F = F
+        self.T = T
+        self.n_freq_masks = n_freq_masks
+        self.n_time_masks = n_time_masks
+        self.p = p
+        #logger.info(f"specaug: F-{F}, T-{T}, F-n-{n_freq_masks}, T-n-{n_time_masks}")
+
+        # adaptive SpecAugment
+        self.adaptive_number_ratio = adaptive_number_ratio
+        self.adaptive_size_ratio = adaptive_size_ratio
+        self.max_n_time_masks = max_n_time_masks
+
+        if adaptive_number_ratio > 0:
+            self.n_time_masks = 0
+            logger.info('n_time_masks is set ot zero for adaptive SpecAugment.')
+        if adaptive_size_ratio > 0:
+            self.T = 0
+            logger.info('T is set to zero for adaptive SpecAugment.')
+
+        self._freq_mask = None
+        self._time_mask = None
+
+    def librispeech_basic(self):
+        self.W = 80
+        self.F = 27
+        self.T = 100
+        self.n_freq_masks = 1
+        self.n_time_masks = 1
+        self.p = 1.0
+
+    def librispeech_double(self):
+        self.W = 80
+        self.F = 27
+        self.T = 100
+        self.n_freq_masks = 2
+        self.n_time_masks = 2
+        self.p = 1.0
+
+    def switchboard_mild(self):
+        self.W = 40
+        self.F = 15
+        self.T = 70
+        self.n_freq_masks = 2
+        self.n_time_masks = 2
+        self.p = 0.2
+
+    def switchboard_strong(self):
+        self.W = 40
+        self.F = 27
+        self.T = 70
+        self.n_freq_masks = 2
+        self.n_time_masks = 2
+        self.p = 0.2
+
+    @property
+    def freq_mask(self):
+        return self._freq_mask
+
+    @property
+    def time_mask(self):
+        return self._time_mask
+
+    def time_warp(xs, W=40):
+        raise NotImplementedError
+
+    def mask_freq(self, xs, replace_with_zero=False):
+        n_bins = xs.shape[0]
+        for i in range(0, self.n_freq_masks):
+            f = int(self._rng.uniform(low=0, high=self.F))
+            f_0 = int(self._rng.uniform(low=0, high=n_bins - f))
+            xs[f_0:f_0 + f, :] = 0
+            assert f_0 <= f_0 + f
+            self._freq_mask = (f_0, f_0 + f)
+        return xs
+
+    def mask_time(self, xs, replace_with_zero=False):
+        n_frames = xs.shape[1]
+
+        if self.adaptive_number_ratio > 0:
+            n_masks = int(n_frames * self.adaptive_number_ratio)
+            n_masks = min(n_masks, self.max_n_time_masks)
+        else:
+            n_masks = self.n_time_masks
+
+        if self.adaptive_size_ratio > 0:
+            T = self.adaptive_size_ratio * n_frames
+        else:
+            T = self.T
+
+        for i in range(n_masks):
+            t = int(self._rng.uniform(low=0, high=T))
+            t = min(t, int(n_frames * self.p))
+            t_0 = int(self._rng.uniform(low=0, high=n_frames - t))
+            xs[:, t_0:t_0 + t] = 0
+            assert t_0 <= t_0 + t
+            self._time_mask = (t_0, t_0 + t)
+        return xs
+
+    def transform_feature(self, xs: np.ndarray):
+        """
+        Args:
+            xs (FloatTensor): `[F, T]`
+        Returns:
+            xs (FloatTensor): `[F, T]`
+        """
+        # xs = self.time_warp(xs)
+        xs = self.mask_freq(xs)
+        xs = self.mask_time(xs)
+        return xs
--- a/deepspeech/frontend/augmentor/speed_perturb.py
+++ b/deepspeech/frontend/augmentor/speed_perturb.py
@ -12,36 +12,72 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 """Contain the speech perturbation augmentation model."""
+import numpy as np

 from deepspeech.frontend.augmentor.base import AugmentorBase


 class SpeedPerturbAugmentor(AugmentorBase):
-    """Augmentation model for adding speed perturbation.
-
-    See reference paper here:
-    http://www.danielpovey.com/files/2015_interspeech_augmentation.pdf
-
-    :param rng: Random generator object.
-    :type rng: random.Random
-    :param min_speed_rate: Lower bound of new speed rate to sample and should
-                           not be smaller than 0.9.
-    :type min_speed_rate: float
-    :param max_speed_rate: Upper bound of new speed rate to sample and should
-                           not be larger than 1.1.
-    :type max_speed_rate: float
-    """
-
-    def __init__(self, rng, min_speed_rate, max_speed_rate):
+    """Augmentation model for adding speed perturbation."""
+
+    def __init__(self, rng, min_speed_rate=0.9, max_speed_rate=1.1,
+                 num_rates=3):
+        """speed perturbation.
+        
+        The speed perturbation in kaldi uses sox-speed instead of sox-tempo,
+        and sox-speed just to resample the input,
+        i.e pitch and tempo are changed both.
+
+        "Why use speed option instead of tempo -s in SoX for speed perturbation"
+        https://groups.google.com/forum/#!topic/kaldi-help/8OOG7eE4sZ8
+    
+        Sox speed:
+        https://pysox.readthedocs.io/en/latest/api.html#sox.transform.Transformer
+        
+        See reference paper here:
+        http://www.danielpovey.com/files/2015_interspeech_augmentation.pdf
+        
+        Espnet:
+        https://espnet.github.io/espnet/_modules/espnet/transform/perturb.html
+        
+        Nemo:
+        https://github.com/NVIDIA/NeMo/blob/main/nemo/collections/asr/parts/perturb.py#L92
+
+        Args:
+            rng (random.Random): Random generator object.
+            min_speed_rate (float): Lower bound of new speed rate to sample and should
+                not be smaller than 0.9.
+            max_speed_rate (float): Upper bound of new speed rate to sample and should
+                not be larger than 1.1.
+            num_rates (int, optional): Number of discrete rates to allow. 
+                Can be a positive or negative integer. Defaults to 3.
+                If a positive integer greater than 0 is provided, the range of
+                speed rates will be discretized into `num_rates` values.
+                If a negative integer or 0 is provided, the full range of speed rates
+                will be sampled uniformly.
+                Note: If a positive integer is provided and the resultant discretized
+                range of rates contains the value '1.0', then those samples with rate=1.0,
+                will not be augmented at all and simply skipped. This is to unnecessary
+                augmentation and increase computation time. Effective augmentation chance
+                in such a case is = `prob * (num_rates - 1 / num_rates) * 100`% chance
+                where `prob` is the global probability of a sample being augmented.
+
+        Raises:
+            ValueError: when speed_rate error
+        """
        if min_speed_rate < 0.9:
            raise ValueError(
                "Sampling speed below 0.9 can cause unnatural effects")
        if max_speed_rate > 1.1:
            raise ValueError(
                "Sampling speed above 1.1 can cause unnatural effects")
-        self._min_speed_rate = min_speed_rate
-        self._max_speed_rate = max_speed_rate
+        self._min_rate = min_speed_rate
+        self._max_rate = max_speed_rate
        self._rng = rng
+        self._num_rates = num_rates
+        if num_rates > 0:
+            self._rates = np.linspace(
+                self._min_rate, self._max_rate, self._num_rates, endpoint=True)

    def transform_audio(self, audio_segment):
        """Sample a new speed rate from the given range and
@ -52,6 +88,13 @@ class SpeedPerturbAugmentor(AugmentorBase):
        :param audio_segment: Audio segment to add effects to.
        :type audio_segment: AudioSegment|SpeechSegment
        """
-        sampled_speed = self._rng.uniform(self._min_speed_rate,
-                                          self._max_speed_rate)
-        audio_segment.change_speed(sampled_speed)
+        if self._num_rates < 0:
+            speed_rate = self._rng.uniform(self._min_rate, self._max_rate)
+        else:
+            speed_rate = self._rng.choice(self._rates)
+
+        # Skip perturbation in case of identity speed rate
+        if speed_rate == 1.0:
+            return
+
+        audio_segment.change_speed(speed_rate)
--- a/deepspeech/frontend/augmentor/volume_perturb.py
+++ b/deepspeech/frontend/augmentor/volume_perturb.py
@ -12,7 +12,6 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 """Contains the volume perturb augmentation model."""
-
 from deepspeech.frontend.augmentor.base import AugmentorBase


--- a/deepspeech/frontend/featurizer/audio_featurizer.py
+++ b/deepspeech/frontend/featurizer/audio_featurizer.py
@ -12,12 +12,10 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 """Contains the audio featurizer class."""
-
 import numpy as np
-from deepspeech.frontend.utility import read_manifest
-from deepspeech.frontend.audio import AudioSegment
-from python_speech_features import mfcc
 from python_speech_features import delta
+from python_speech_features import logfbank
+from python_speech_features import mfcc


 class AudioFeaturizer(object):
@ -49,15 +47,22 @@ class AudioFeaturizer(object):
    """

    def __init__(self,
-                 specgram_type='linear',
+                 specgram_type: str='linear',
+                 feat_dim: int=None,
+                 delta_delta: bool=False,
                 stride_ms=10.0,
                 window_ms=20.0,
                 n_fft=None,
                 max_freq=None,
                 target_sample_rate=16000,
                 use_dB_normalization=True,
-                 target_dB=-20):
+                 target_dB=-20,
+                 dither=1.0):
        self._specgram_type = specgram_type
+        # mfcc and fbank using `feat_dim`
+        self._feat_dim = feat_dim
+        # mfcc and fbank using `delta-delta`
+        self._delta_delta = delta_delta
        self._stride_ms = stride_ms
        self._window_ms = window_ms
        self._max_freq = max_freq
@ -65,6 +70,7 @@ class AudioFeaturizer(object):
        self._use_dB_normalization = use_dB_normalization
        self._target_dB = target_dB
        self._fft_point = n_fft
+        self._dither = dither

    def featurize(self,
                  audio_segment,
@ -97,8 +103,11 @@ class AudioFeaturizer(object):
        if self._use_dB_normalization:
            audio_segment.normalize(target_db=self._target_dB)
        # extract spectrogram
-        return self._compute_specgram(audio_segment.samples,
-                                      audio_segment.sample_rate)
+        return self._compute_specgram(audio_segment)
+
+    @property
+    def stride_ms(self):
+        return self._stride_ms

    @property
    def feature_size(self):
@ -109,22 +118,51 @@ class AudioFeaturizer(object):
            feat_dim = int(fft_point * (self._target_sample_rate / 1000) / 2 +
                           1)
        elif self._specgram_type == 'mfcc':
-            # mfcc,delta, delta-delta
-            feat_dim = int(13 * 3)
+            # mfcc, delta, delta-delta
+            feat_dim = int(self._feat_dim *
+                           3) if self._delta_delta else int(self._feat_dim)
+        elif self._specgram_type == 'fbank':
+            # fbank, delta, delta-delta
+            feat_dim = int(self._feat_dim *
+                           3) if self._delta_delta else int(self._feat_dim)
        else:
            raise ValueError("Unknown specgram_type %s. "
                             "Supported values: linear." % self._specgram_type)
        return feat_dim

-    def _compute_specgram(self, samples, sample_rate):
+    def _compute_specgram(self, audio_segment):
        """Extract various audio features."""
+        sample_rate = audio_segment.sample_rate
        if self._specgram_type == 'linear':
+            samples = audio_segment.samples
            return self._compute_linear_specgram(
-                samples, sample_rate, self._stride_ms, self._window_ms,
-                self._max_freq)
+                samples,
+                sample_rate,
+                stride_ms=self._stride_ms,
+                window_ms=self._window_ms,
+                max_freq=self._max_freq)
        elif self._specgram_type == 'mfcc':
-            return self._compute_mfcc(samples, sample_rate, self._stride_ms,
-                                      self._window_ms, self._max_freq)
+            samples = audio_segment.to('int16')
+            return self._compute_mfcc(
+                samples,
+                sample_rate,
+                feat_dim=self._feat_dim,
+                stride_ms=self._stride_ms,
+                window_ms=self._window_ms,
+                max_freq=self._max_freq,
+                dither=self._dither,
+                delta_delta=self._delta_delta)
+        elif self._specgram_type == 'fbank':
+            samples = audio_segment.to('int16')
+            return self._compute_fbank(
+                samples,
+                sample_rate,
+                feat_dim=self._feat_dim,
+                stride_ms=self._stride_ms,
+                window_ms=self._window_ms,
+                max_freq=self._max_freq,
+                dither=self._dither,
+                delta_delta=self._delta_delta)
        else:
            raise ValueError("Unknown specgram_type %s. "
                             "Supported values: linear." % self._specgram_type)
@ -179,13 +217,55 @@ class AudioFeaturizer(object):
        freqs = float(sample_rate) / window_size * np.arange(fft.shape[0])
        return fft, freqs

+    def _concat_delta_delta(self, feat):
+        """append delat, delta-delta feature.
+
+        Args:
+            feat (np.ndarray): (D, T)
+
+        Returns:
+            np.ndarray: feat with delta-delta, (3*D, T)
+        """
+        feat = np.transpose(feat)
+        # Deltas
+        d_feat = delta(feat, 2)
+        # Deltas-Deltas
+        dd_feat = delta(feat, 2)
+        # transpose
+        feat = np.transpose(feat)
+        d_feat = np.transpose(d_feat)
+        dd_feat = np.transpose(dd_feat)
+        # concat above three features
+        concat_feat = np.concatenate((feat, d_feat, dd_feat))
+        return concat_feat
+
    def _compute_mfcc(self,
                      samples,
                      sample_rate,
+                      feat_dim=13,
                      stride_ms=10.0,
-                      window_ms=20.0,
-                      max_freq=None):
-        """Compute mfcc from samples."""
+                      window_ms=25.0,
+                      max_freq=None,
+                      dither=1.0,
+                      delta_delta=True):
+        """Compute mfcc from samples.
+
+        Args:
+            samples (np.ndarray, np.int16): the audio signal from which to compute features.
+            sample_rate (float): the sample rate of the signal we are working with, in Hz.
+            feat_dim (int): the number of cepstrum to return, default 13.
+            stride_ms (float, optional): stride length in ms. Defaults to 10.0.
+            window_ms (float, optional): window length in ms. Defaults to 25.0.
+            max_freq ([type], optional): highest band edge of mel filters. In Hz, default is samplerate/2. Defaults to None.
+            delta_delta (bool, optional): Whether with delta delta. Defaults to False.
+
+        Raises:
+            ValueError: max_freq > samplerate/2
+            ValueError: stride_ms > window_ms
+
+        Returns:
+            np.ndarray: mfcc feature, (D, T).
+        """
        if max_freq is None:
            max_freq = sample_rate / 2
        if max_freq > sample_rate / 2:
@ -195,22 +275,79 @@ class AudioFeaturizer(object):
            raise ValueError("Stride size must not be greater than "
                             "window size.")
        # compute the 13 cepstral coefficients, and the first one is replaced
-        # by log(frame energy)
+        # by log(frame energy), (T, D)
        mfcc_feat = mfcc(
            signal=samples,
            samplerate=sample_rate,
            winlen=0.001 * window_ms,
            winstep=0.001 * stride_ms,
-            highfreq=max_freq)
-        # Deltas
-        d_mfcc_feat = delta(mfcc_feat, 2)
-        # Deltas-Deltas
-        dd_mfcc_feat = delta(d_mfcc_feat, 2)
-        # transpose
+            numcep=feat_dim,
+            nfilt=23,
+            nfft=512,
+            lowfreq=20,
+            highfreq=max_freq,
+            dither=dither,
+            remove_dc_offset=True,
+            preemph=0.97,
+            ceplifter=22,
+            useEnergy=True,
+            winfunc='povey')
        mfcc_feat = np.transpose(mfcc_feat)
-        d_mfcc_feat = np.transpose(d_mfcc_feat)
-        dd_mfcc_feat = np.transpose(dd_mfcc_feat)
-        # concat above three features
-        concat_mfcc_feat = np.concatenate(
-            (mfcc_feat, d_mfcc_feat, dd_mfcc_feat))
-        return concat_mfcc_feat
+        if delta_delta:
+            mfcc_feat = self._concat_delta_delta(mfcc_feat)
+        return mfcc_feat
+
+    def _compute_fbank(self,
+                       samples,
+                       sample_rate,
+                       feat_dim=40,
+                       stride_ms=10.0,
+                       window_ms=25.0,
+                       max_freq=None,
+                       dither=1.0,
+                       delta_delta=False):
+        """Compute logfbank from samples.
+        
+        Args:
+            samples (np.ndarray, np.int16): the audio signal from which to compute features. Should be an N*1 array
+            sample_rate (float): the sample rate of the signal we are working with, in Hz.
+            feat_dim (int): the number of cepstrum to return, default 13.
+            stride_ms (float, optional): stride length in ms. Defaults to 10.0.
+            window_ms (float, optional): window length in ms. Defaults to 20.0.
+            max_freq (float, optional): highest band edge of mel filters. In Hz, default is samplerate/2. Defaults to None.
+            delta_delta (bool, optional): Whether with delta delta. Defaults to False.
+
+        Raises:
+            ValueError: max_freq > samplerate/2
+            ValueError: stride_ms > window_ms
+
+        Returns:
+            np.ndarray: mfcc feature, (D, T).
+        """
+        if max_freq is None:
+            max_freq = sample_rate / 2
+        if max_freq > sample_rate / 2:
+            raise ValueError("max_freq must not be greater than half of "
+                             "sample rate.")
+        if stride_ms > window_ms:
+            raise ValueError("Stride size must not be greater than "
+                             "window size.")
+        # (T, D)
+        fbank_feat = logfbank(
+            signal=samples,
+            samplerate=sample_rate,
+            winlen=0.001 * window_ms,
+            winstep=0.001 * stride_ms,
+            nfilt=feat_dim,
+            nfft=512,
+            lowfreq=20,
+            highfreq=max_freq,
+            dither=dither,
+            remove_dc_offset=True,
+            preemph=0.97,
+            wintype='povey')
+
+        fbank_feat = np.transpose(fbank_feat)
+        if delta_delta:
+            fbank_feat = self._concat_delta_delta(fbank_feat)
+        return fbank_feat
--- a/deepspeech/frontend/featurizer/speech_featurizer.py
+++ b/deepspeech/frontend/featurizer/speech_featurizer.py
@ -12,7 +12,6 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 """Contains the speech featurizer class."""
-
 from deepspeech.frontend.featurizer.audio_featurizer import AudioFeaturizer
 from deepspeech.frontend.featurizer.text_featurizer import TextFeaturizer

@ -52,25 +51,34 @@ class SpeechFeaturizer(object):
    """

    def __init__(self,
+                 unit_type,
                 vocab_filepath,
+                 spm_model_prefix=None,
                 specgram_type='linear',
+                 feat_dim=None,
+                 delta_delta=False,
                 stride_ms=10.0,
                 window_ms=20.0,
                 n_fft=None,
                 max_freq=None,
                 target_sample_rate=16000,
                 use_dB_normalization=True,
-                 target_dB=-20):
+                 target_dB=-20,
+                 dither=1.0):
        self._audio_featurizer = AudioFeaturizer(
            specgram_type=specgram_type,
+            feat_dim=feat_dim,
+            delta_delta=delta_delta,
            stride_ms=stride_ms,
            window_ms=window_ms,
            n_fft=n_fft,
            max_freq=max_freq,
            target_sample_rate=target_sample_rate,
            use_dB_normalization=use_dB_normalization,
-            target_dB=target_dB)
-        self._text_featurizer = TextFeaturizer(vocab_filepath)
+            target_dB=target_dB,
+            dither=dither)
+        self._text_featurizer = TextFeaturizer(unit_type, vocab_filepath,
+                                               spm_model_prefix)

    def featurize(self, speech_segment, keep_transcription_text):
        """Extract features for speech segment.
@ -79,24 +87,29 @@ class SpeechFeaturizer(object):
        2. For transcript parts, keep the original text or convert text string
           to a list of token indices in char-level.

-        :param audio_segment: Speech segment to extract features from.
-        :type audio_segment: SpeechSegment
-        :return: A tuple of 1) spectrogram audio feature in 2darray, 2) list of
-                 char-level token indices.
-        :rtype: tuple
+        Args:
+            speech_segment (SpeechSegment): Speech segment to extract features from.
+            keep_transcription_text (bool): True, keep transcript text, False, token ids
+
+        Returns:
+            tuple: 1) spectrogram audio feature in 2darray, 2) list oftoken indices.
        """
-        audio_feature = self._audio_featurizer.featurize(speech_segment)
+        spec_feature = self._audio_featurizer.featurize(speech_segment)
        if keep_transcription_text:
-            return audio_feature, speech_segment.transcript
-        text_ids = self._text_featurizer.featurize(speech_segment.transcript)
-        return audio_feature, text_ids
+            return spec_feature, speech_segment.transcript
+        if speech_segment.has_token:
+            text_ids = speech_segment.token_ids
+        else:
+            text_ids = self._text_featurizer.featurize(
+                speech_segment.transcript)
+        return spec_feature, text_ids

    @property
    def vocab_size(self):
        """Return the vocabulary size.

-        :return: Vocabulary size.
-        :rtype: int
+        Returns:
+            int: Vocabulary size.
        """
        return self._text_featurizer.vocab_size

@ -104,16 +117,43 @@ class SpeechFeaturizer(object):
    def vocab_list(self):
        """Return the vocabulary in list.

-        :return: Vocabulary in list.
-        :rtype: list
+        Returns:
+            List[str]: 
        """
        return self._text_featurizer.vocab_list

+    @property
+    def vocab_dict(self):
+        """Return the vocabulary in dict.
+
+        Returns:
+            Dict[str, int]: 
+        """
+        return self._text_featurizer.vocab_dict
+
    @property
    def feature_size(self):
        """Return the audio feature size.

-        :return: audio feature size.
-        :rtype: int
+        Returns:
+            int: audio feature size.
+        """
+        return self._audio_featurizer.feature_size
+
+    @property
+    def stride_ms(self):
+        """time length in `ms` unit per frame
+
+        Returns:
+            float: time(ms)/frame
+        """
+        return self._audio_featurizer.stride_ms
+
+    @property
+    def text_feature(self):
+        """Return the text feature object.
+
+        Returns:
+            TextFeaturizer: object.
        """
-        return self._audio_featurizer.feature_size
+        return self._text_featurizer
--- a/deepspeech/frontend/featurizer/text_featurizer.py
+++ b/deepspeech/frontend/featurizer/text_featurizer.py
@ -12,44 +12,91 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 """Contains the text featurizer class."""
+import sentencepiece as spm

-import os
-import codecs
+from deepspeech.frontend.utility import EOS
+from deepspeech.frontend.utility import UNK


 class TextFeaturizer(object):
-    """Text featurizer, for processing or extracting features from text.
+    def __init__(self, unit_type, vocab_filepath, spm_model_prefix=None):
+        """Text featurizer, for processing or extracting features from text.

-    Currently, it only supports char-level tokenizing and conversion into
-    a list of token indices. Note that the token indexing order follows the
-    given vocabulary file.
+        Currently, it supports char/word/sentence-piece level tokenizing and conversion into
+        a list of token indices. Note that the token indexing order follows the
+        given vocabulary file.

-    :param vocab_filepath: Filepath to load vocabulary for token indices
-                           conversion.
-    :type specgram_type: str
-    """
+        Args:
+            unit_type (str): unit type, e.g. char, word, spm
+            vocab_filepath (str): Filepath to load vocabulary for token indices conversion.
+            spm_model_prefix (str, optional): spm model prefix. Defaults to None.
+        """
+        assert unit_type in ('char', 'spm', 'word')
+        self.unit_type = unit_type
+        self.unk = UNK
+        if vocab_filepath:
+            self._vocab_dict, self._id2token, self._vocab_list = self._load_vocabulary_from_file(
+                vocab_filepath)
+            self.unk_id = self._vocab_list.index(self.unk)
+            self.eos_id = self._vocab_list.index(EOS)
+
+        if unit_type == 'spm':
+            spm_model = spm_model_prefix + '.model'
+            self.sp = spm.SentencePieceProcessor()
+            self.sp.Load(spm_model)
+
+    def tokenize(self, text):
+        if self.unit_type == 'char':
+            tokens = self.char_tokenize(text)
+        elif self.unit_type == 'word':
+            tokens = self.word_tokenize(text)
+        else:  # spm
+            tokens = self.spm_tokenize(text)
+        return tokens

-    def __init__(self, vocab_filepath):
-        self.unk = '<unk>'
-        self._vocab_dict, self._vocab_list = self._load_vocabulary_from_file(
-            vocab_filepath)
+    def detokenize(self, tokens):
+        if self.unit_type == 'char':
+            text = self.char_detokenize(tokens)
+        elif self.unit_type == 'word':
+            text = self.word_detokenize(tokens)
+        else:  # spm
+            text = self.spm_detokenize(tokens)
+        return text

    def featurize(self, text):
-        """Convert text string to a list of token indices in char-level.Note
-        that the token indexing order follows the given vocabulary file.
+        """Convert text string to a list of token indices.

-        :param text: Text to process.
-        :type text: str
-        :return: List of char-level token indices.
-        :rtype: list
+        Args:
+            text (str): Text to process.
+        
+        Returns:
+            List[int]: List of token indices.
        """
-        tokens = self._char_tokenize(text)
+        tokens = self.tokenize(text)
        ids = []
        for token in tokens:
            token = token if token in self._vocab_dict else self.unk
            ids.append(self._vocab_dict[token])
        return ids

+    def defeaturize(self, idxs):
+        """Convert a list of token indices to text string,
+        ignore index after eos_id. 
+
+        Args:
+            idxs (List[int]): List of token indices.
+
+        Returns:
+            str: Text to process.
+        """
+        tokens = []
+        for idx in idxs:
+            if idx == self.eos_id:
+                break
+            tokens.append(self._id2token[idx])
+        text = self.detokenize(tokens)
+        return text
+
    @property
    def vocab_size(self):
        """Return the vocabulary size.
@ -63,21 +110,110 @@ class TextFeaturizer(object):
    def vocab_list(self):
        """Return the vocabulary in list.

-        :return: Vocabulary in list.
-        :rtype: list
+        Returns:
+            List[str]: tokens.
        """
        return self._vocab_list

-    def _char_tokenize(self, text):
-        """Character tokenizer."""
+    @property
+    def vocab_dict(self):
+        """Return the vocabulary in dict.
+
+        Returns:
+            Dict[str, int]: token str -> int
+        """
+        return self._vocab_dict
+
+    def char_tokenize(self, text):
+        """Character tokenizer.
+
+        Args:
+            text (str): text string.
+
+        Returns:
+            List[str]: tokens.
+        """
        return list(text.strip())

+    def char_detokenize(self, tokens):
+        """Character detokenizer.
+
+        Args:
+            tokens (List[str]): tokens.
+
+        Returns:
+           str: text string.
+        """
+        return "".join(tokens)
+
+    def word_tokenize(self, text):
+        """Word tokenizer, separate by <space>."""
+        return text.strip().split()
+
+    def word_detokenize(self, tokens):
+        """Word detokenizer, separate by <space>."""
+        return " ".join(tokens)
+
+    def spm_tokenize(self, text):
+        """spm tokenize.
+
+        Args:
+            text (str): text string.
+
+        Returns:
+            List[str]: sentence pieces str code
+        """
+        stats = {"num_empty": 0, "num_filtered": 0}
+
+        def valid(line):
+            return True
+
+        def encode(l):
+            return self.sp.EncodeAsPieces(l)
+
+        def encode_line(line):
+            line = line.strip()
+            if len(line) > 0:
+                line = encode(line)
+                if valid(line):
+                    return line
+                else:
+                    stats["num_filtered"] += 1
+            else:
+                stats["num_empty"] += 1
+            return None
+
+        enc_line = encode_line(text)
+        return enc_line
+
+    def spm_detokenize(self, tokens, input_format='piece'):
+        """spm detokenize.
+
+        Args:
+            ids (List[str]): tokens.
+
+        Returns:
+            str: text
+        """
+        if input_format == "piece":
+
+            def decode(l):
+                return "".join(self.sp.DecodePieces(l))
+        elif input_format == "id":
+
+            def decode(l):
+                return "".join(self.sp.DecodeIds(l))
+
+        return decode(tokens)
+
    def _load_vocabulary_from_file(self, vocab_filepath):
        """Load vocabulary from file."""
        vocab_lines = []
-        with codecs.open(vocab_filepath, 'r', 'utf-8') as file:
+        with open(vocab_filepath, 'r', encoding='utf-8') as file:
            vocab_lines.extend(file.readlines())
        vocab_list = [line[:-1] for line in vocab_lines]
-        vocab_dict = dict(
-            [(token, id) for (id, token) in enumerate(vocab_list)])
-        return vocab_dict, vocab_list
+        id2token = dict(
+            [(idx, token) for (idx, token) in enumerate(vocab_list)])
+        token2id = dict(
+            [(token, idx) for (idx, token) in enumerate(vocab_list)])
+        return token2id, id2token, vocab_list
--- a/deepspeech/frontend/normalizer.py
+++ b/deepspeech/frontend/normalizer.py
@ -12,11 +12,68 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 """Contains feature normalizers."""
+import json

 import numpy as np
-import random
-from deepspeech.frontend.utility import read_manifest
+import paddle
+from paddle.io import DataLoader
+from paddle.io import Dataset
+
 from deepspeech.frontend.audio import AudioSegment
+from deepspeech.frontend.utility import load_cmvn
+from deepspeech.frontend.utility import read_manifest
+from deepspeech.utils.log import Log
+
+__all__ = ["FeatureNormalizer"]
+
+logger = Log(__name__).getlog()
+
+
+# https://github.com/PaddlePaddle/Paddle/pull/31481
+class CollateFunc(object):
+    def __init__(self, feature_func):
+        self.feature_func = feature_func
+
+    def __call__(self, batch):
+        mean_stat = None
+        var_stat = None
+        number = 0
+        for item in batch:
+            audioseg = AudioSegment.from_file(item['feat'])
+            feat = self.feature_func(audioseg)  #(D, T)
+
+            sums = np.sum(feat, axis=1)
+            if mean_stat is None:
+                mean_stat = sums
+            else:
+                mean_stat += sums
+
+            square_sums = np.sum(np.square(feat), axis=1)
+            if var_stat is None:
+                var_stat = square_sums
+            else:
+                var_stat += square_sums
+
+            number += feat.shape[1]
+        return number, mean_stat, var_stat
+
+
+class AudioDataset(Dataset):
+    def __init__(self, manifest_path, num_samples=-1, rng=None, random_seed=0):
+        self._rng = rng if rng else np.random.RandomState(random_seed)
+        manifest = read_manifest(manifest_path)
+        if num_samples == -1:
+            sampled_manifest = manifest
+        else:
+            sampled_manifest = self._rng.choice(
+                manifest, num_samples, replace=False)
+        self.items = sampled_manifest
+
+    def __len__(self):
+        return len(self.items)
+
+    def __getitem__(self, idx):
+        return self.items[idx]


 class FeatureNormalizer(object):
@ -47,27 +104,35 @@ class FeatureNormalizer(object):
                 manifest_path=None,
                 featurize_func=None,
                 num_samples=500,
+                 num_workers=0,
                 random_seed=0):
        if not mean_std_filepath:
            if not (manifest_path and featurize_func):
                raise ValueError("If mean_std_filepath is None, meanifest_path "
                                 "and featurize_func should not be None.")
-            self._rng = random.Random(random_seed)
-            self._compute_mean_std(manifest_path, featurize_func, num_samples)
+            self._rng = np.random.RandomState(random_seed)
+            self._compute_mean_std(manifest_path, featurize_func, num_samples,
+                                   num_workers)
        else:
            self._read_mean_std_from_file(mean_std_filepath)

-    def apply(self, features, eps=1e-14):
+    def apply(self, features):
        """Normalize features to be of zero mean and unit stddev.

        :param features: Input features to be normalized.
-        :type features: ndarray
+        :type features: ndarray, shape (D, T)
        :param eps:  added to stddev to provide numerical stablibity.
        :type eps: float
        :return: Normalized features.
        :rtype: ndarray
        """
-        return (features - self._mean) / (self._std + eps)
+        return (features - self._mean) * self._istd
+
+    def _read_mean_std_from_file(self, filepath, eps=1e-20):
+        """Load mean and std from file."""
+        mean, istd = load_cmvn(filepath, filetype='json')
+        self._mean = np.expand_dims(mean, axis=-1)
+        self._istd = np.expand_dims(istd, axis=-1)

    def write_to_file(self, filepath):
        """Write the mean and stddev to the file.
@ -75,23 +140,52 @@ class FeatureNormalizer(object):
        :param filepath: File to write mean and stddev.
        :type filepath: str
        """
-        np.savez(filepath, mean=self._mean, std=self._std)
-
-    def _read_mean_std_from_file(self, filepath):
-        """Load mean and std from file."""
-        npzfile = np.load(filepath)
-        self._mean = npzfile["mean"]
-        self._std = npzfile["std"]
+        with open(filepath, 'w') as fout:
+            fout.write(json.dumps(self.cmvn_info))

-    def _compute_mean_std(self, manifest_path, featurize_func, num_samples):
+    def _compute_mean_std(self,
+                          manifest_path,
+                          featurize_func,
+                          num_samples,
+                          num_workers,
+                          batch_size=64,
+                          eps=1e-20):
        """Compute mean and std from randomly sampled instances."""
-        manifest = read_manifest(manifest_path)
-        sampled_manifest = self._rng.sample(manifest, num_samples)
-        features = []
-        for instance in sampled_manifest:
-            features.append(
-                featurize_func(
-                    AudioSegment.from_file(instance["audio_filepath"])))
-        features = np.hstack(features)
-        self._mean = np.mean(features, axis=1).reshape([-1, 1])
-        self._std = np.std(features, axis=1).reshape([-1, 1])
+        paddle.set_device('cpu')
+
+        collate_func = CollateFunc(featurize_func)
+        dataset = AudioDataset(manifest_path, num_samples, self._rng)
+        data_loader = DataLoader(
+            dataset,
+            batch_size=batch_size,
+            shuffle=False,
+            num_workers=num_workers,
+            collate_fn=collate_func)
+
+        with paddle.no_grad():
+            all_mean_stat = None
+            all_var_stat = None
+            all_number = 0
+            wav_number = 0
+            for i, batch in enumerate(data_loader):
+                number, mean_stat, var_stat = batch
+                if i == 0:
+                    all_mean_stat = mean_stat
+                    all_var_stat = var_stat
+                else:
+                    all_mean_stat += mean_stat
+                    all_var_stat += var_stat
+                all_number += number
+                wav_number += batch_size
+
+                if wav_number % 1000 == 0:
+                    logger.info('process {} wavs,{} frames'.format(wav_number,
+                                                                   all_number))
+
+        self.cmvn_info = {
+            'mean_stat': list(all_mean_stat.tolist()),
+            'var_stat': list(all_var_stat.tolist()),
+            'frame_num': all_number,
+        }
+
+        return self.cmvn_info
--- a/deepspeech/frontend/speech.py
+++ b/deepspeech/frontend/speech.py
@ -12,8 +12,8 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 """Contains the speech segment class."""
-
 import numpy as np
+
 from deepspeech.frontend.audio import AudioSegment


@ -24,7 +24,12 @@ class SpeechSegment(AudioSegment):
        AudioSegment (AudioSegment): Audio Segment
    """

-    def __init__(self, samples, sample_rate, transcript):
+    def __init__(self,
+                 samples,
+                 sample_rate,
+                 transcript,
+                 tokens=None,
+                 token_ids=None):
        """Speech segment abstraction, a subclass of AudioSegment,
            with an additional transcript.

@ -32,9 +37,14 @@ class SpeechSegment(AudioSegment):
            samples (ndarray.float32): Audio samples [num_samples x num_channels].
            sample_rate (int): Audio sample rate.
            transcript (str): Transcript text for the speech.
+            tokens (List[str], optinal): Transcript tokens for the speech.
+            token_ids (List[int], optional): Transcript token ids for the speech.
        """
        AudioSegment.__init__(self, samples, sample_rate)
        self._transcript = transcript
+        # must init `tokens` with `token_ids` at the same time
+        self._tokens = tokens
+        self._token_ids = token_ids

    def __eq__(self, other):
        """Return whether two objects are equal.
@ -46,6 +56,11 @@ class SpeechSegment(AudioSegment):
            return False
        if self._transcript != other._transcript:
            return False
+        if self.has_token and other.has_token:
+            if self._tokens != other._tokens:
+                return False
+            if self._token_ids != other._token_ids:
+                return False
        return True

    def __ne__(self, other):
@ -53,33 +68,39 @@ class SpeechSegment(AudioSegment):
        return not self.__eq__(other)

    @classmethod
-    def from_file(cls, filepath, transcript):
+    def from_file(cls, filepath, transcript, tokens=None, token_ids=None):
        """Create speech segment from audio file and corresponding transcript.
-        
-        :param filepath: Filepath or file object to audio file.
-        :type filepath: str|file
-        :param transcript: Transcript text for the speech.
-        :type transript: str
-        :return: Speech segment instance.
-        :rtype: SpeechSegment
+
+        Args:
+            filepath (str|file): Filepath or file object to audio file.
+            transcript (str): Transcript text for the speech.
+            tokens (List[str], optional): text tokens. Defaults to None.
+            token_ids (List[int], optional): text token ids. Defaults to None.
+
+        Returns:
+            SpeechSegment: Speech segment instance.
        """
+
        audio = AudioSegment.from_file(filepath)
-        return cls(audio.samples, audio.sample_rate, transcript)
+        return cls(audio.samples, audio.sample_rate, transcript, tokens,
+                   token_ids)

    @classmethod
-    def from_bytes(cls, bytes, transcript):
+    def from_bytes(cls, bytes, transcript, tokens=None, token_ids=None):
        """Create speech segment from a byte string and corresponding
-        transcript.
-        
-        :param bytes: Byte string containing audio samples.
-        :type bytes: str
-        :param transcript: Transcript text for the speech.
-        :type transript: str
-        :return: Speech segment instance.
-        :rtype: Speech Segment
+
+        Args:
+            filepath (str|file): Filepath or file object to audio file.
+            transcript (str): Transcript text for the speech.
+            tokens (List[str], optional): text tokens. Defaults to None.
+            token_ids (List[int], optional): text token ids. Defaults to None.
+
+        Returns:
+            SpeechSegment: Speech segment instance.
        """
        audio = AudioSegment.from_bytes(bytes)
-        return cls(audio.samples, audio.sample_rate, transcript)
+        return cls(audio.samples, audio.sample_rate, transcript, tokens,
+                   token_ids)

    @classmethod
    def concatenate(cls, *segments):
@ -98,6 +119,8 @@ class SpeechSegment(AudioSegment):
            raise ValueError("No speech segments are given to concatenate.")
        sample_rate = segments[0]._sample_rate
        transcripts = ""
+        tokens = []
+        token_ids = []
        for seg in segments:
            if sample_rate != seg._sample_rate:
                raise ValueError("Can't concatenate segments with "
@ -106,11 +129,20 @@ class SpeechSegment(AudioSegment):
                raise TypeError("Only speech segments of the same type "
                                "instance can be concatenated.")
            transcripts += seg._transcript
+            if self.has_token:
+                tokens += seg._tokens
+                token_ids += seg._token_ids
        samples = np.concatenate([seg.samples for seg in segments])
-        return cls(samples, sample_rate, transcripts)
+        return cls(samples, sample_rate, transcripts, tokens, token_ids)

    @classmethod
-    def slice_from_file(cls, filepath, transcript, start=None, end=None):
+    def slice_from_file(cls,
+                        filepath,
+                        transcript,
+                        tokens=None,
+                        token_ids=None,
+                        start=None,
+                        end=None):
        """Loads a small section of an speech without having to load
        the entire file into the memory which can be incredibly wasteful.

@ -132,28 +164,54 @@ class SpeechSegment(AudioSegment):
        :rtype: SpeechSegment
        """
        audio = AudioSegment.slice_from_file(filepath, start, end)
-        return cls(audio.samples, audio.sample_rate, transcript)
+        return cls(audio.samples, audio.sample_rate, transcript, tokens,
+                   token_ids)

    @classmethod
    def make_silence(cls, duration, sample_rate):
        """Creates a silent speech segment of the given duration and
        sample rate, transcript will be an empty string.

-        :param duration: Length of silence in seconds.
-        :type duration: float
-        :param sample_rate: Sample rate.
-        :type sample_rate: float
-        :return: Silence of the given duration.
-        :rtype: SpeechSegment
+        Args:
+            duration (float): Length of silence in seconds.
+            sample_rate (float): Sample rate.
+
+        Returns:
+            SpeechSegment: Silence of the given duration.
        """
        audio = AudioSegment.make_silence(duration, sample_rate)
        return cls(audio.samples, audio.sample_rate, "")

+    @property
+    def has_token(self):
+        if self._tokens and self._token_ids:
+            return True
+        return False
+
    @property
    def transcript(self):
        """Return the transcript text.

-        :return: Transcript text for the speech.
-        :rtype: str
+        Returns:
+            str: Transcript text for the speech.
        """
+
        return self._transcript
+
+    @property
+    def tokens(self):
+        """Return the transcript text tokens.
+
+        Returns:
+            List[str]: text tokens.
+        """
+        return self._tokens
+
+    @property
+    def token_ids(self):
+        """Return the transcript text token ids.
+
+        Returns:
+            List[int]: text token ids.
+        """
+        return self._token_ids
--- a/deepspeech/frontend/utility.py
+++ b/deepspeech/frontend/utility.py
@ -12,41 +12,248 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 """Contains data helper functions."""
-
-import json
 import codecs
-import os
-import tarfile
-import time
-from threading import Thread
-from multiprocessing import Process, Manager, Value
+import json
+import math
+
+import numpy as np

-from paddle.dataset.common import md5file
+from deepspeech.utils.log import Log

+logger = Log(__name__).getlog()

-def read_manifest(manifest_path, max_duration=float('inf'), min_duration=0.0):
+__all__ = [
+    "load_cmvn", "read_manifest", "rms_to_db", "rms_to_dbfs", "max_dbfs",
+    "mean_dbfs", "gain_db_to_ratio", "normalize_audio", "SOS", "EOS", "UNK",
+    "BLANK"
+]
+
+IGNORE_ID = -1
+SOS = "<sos/eos>"
+EOS = SOS
+UNK = "<unk>"
+BLANK = "<blank>"
+
+
+def read_manifest(
+        manifest_path,
+        max_input_len=float('inf'),
+        min_input_len=0.0,
+        max_output_len=float('inf'),
+        min_output_len=0.0,
+        max_output_input_ratio=float('inf'),
+        min_output_input_ratio=0.0, ):
    """Load and parse manifest file.

-    Instances with durations outside [min_duration, max_duration] will be
-    filtered out.
+    Args:
+        manifest_path ([type]): Manifest file to load and parse.
+        max_input_len ([type], optional): maximum output seq length, in seconds for raw wav, in frame numbers for feature data. Defaults to float('inf').
+        min_input_len (float, optional): minimum input seq length, in seconds for raw wav, in frame numbers for feature data. Defaults to 0.0.
+        max_output_len (float, optional): maximum input seq length, in modeling units. Defaults to 500.0.
+        min_output_len (float, optional): minimum input seq length, in modeling units. Defaults to 0.0.
+        max_output_input_ratio (float, optional): maximum output seq length/output seq length ratio. Defaults to 10.0.
+        min_output_input_ratio (float, optional): minimum output seq length/output seq length ratio. Defaults to 0.05.
+
+    Raises:
+        IOError: If failed to parse the manifest.

-    :param manifest_path: Manifest file to load and parse.
-    :type manifest_path: str
-    :param max_duration: Maximal duration in seconds for instance filter.
-    :type max_duration: float
-    :param min_duration: Minimal duration in seconds for instance filter.
-    :type min_duration: float
-    :return: Manifest parsing results. List of dict.
-    :rtype: list
-    :raises IOError: If failed to parse the manifest.
+    Returns:
+        List[dict]: Manifest parsing results.
    """
+
    manifest = []
    for json_line in codecs.open(manifest_path, 'r', 'utf-8'):
        try:
            json_data = json.loads(json_line)
        except Exception as e:
            raise IOError("Error reading manifest: %s" % str(e))
-        if (json_data["duration"] <= max_duration and
-                json_data["duration"] >= min_duration):
+
+        feat_len = json_data["feat_shape"][
+            0] if 'feat_shape' in json_data else 1.0
+        token_len = json_data["token_shape"][
+            0] if 'token_shape' in json_data else 1.0
+        conditions = [
+            feat_len >= min_input_len,
+            feat_len <= max_input_len,
+            token_len >= min_output_len,
+            token_len <= max_output_len,
+            token_len / feat_len >= min_output_input_ratio,
+            token_len / feat_len <= max_output_input_ratio,
+        ]
+        if all(conditions):
            manifest.append(json_data)
    return manifest
+
+
+def rms_to_db(rms: float):
+    """Root Mean Square to dB.
+
+    Args:
+        rms ([float]): root mean square
+
+    Returns:
+        float: dB
+    """
+    return 20.0 * math.log10(max(1e-16, rms))
+
+
+def rms_to_dbfs(rms: float):
+    """Root Mean Square to dBFS.
+    https://fireattack.wordpress.com/2017/02/06/replaygain-loudness-normalization-and-applications/
+    Audio is mix of sine wave, so 1 amp sine wave's Full scale is 0.7071, equal to -3.0103dB.
+   
+    dB = dBFS + 3.0103
+    dBFS = db - 3.0103
+    e.g. 0 dB = -3.0103 dBFS
+
+    Args:
+        rms ([float]): root mean square
+
+    Returns:
+        float: dBFS
+    """
+    return rms_to_db(rms) - 3.0103
+
+
+def max_dbfs(sample_data: np.ndarray):
+    """Peak dBFS based on the maximum energy sample. 
+
+    Args:
+        sample_data ([np.ndarray]): float array, [-1, 1].
+
+    Returns:
+        float: dBFS 
+    """
+    # Peak dBFS based on the maximum energy sample. Will prevent overdrive if used for normalization.
+    return rms_to_dbfs(max(abs(np.min(sample_data)), abs(np.max(sample_data))))
+
+
+def mean_dbfs(sample_data):
+    """Peak dBFS based on the RMS energy. 
+
+    Args:
+        sample_data ([np.ndarray]): float array, [-1, 1].
+
+    Returns:
+        float: dBFS 
+    """
+    return rms_to_dbfs(
+        math.sqrt(np.mean(np.square(sample_data, dtype=np.float64))))
+
+
+def gain_db_to_ratio(gain_db: float):
+    """dB to ratio
+
+    Args:
+        gain_db (float): gain in dB
+
+    Returns:
+        float: scale in amp
+    """
+    return math.pow(10.0, gain_db / 20.0)
+
+
+def normalize_audio(sample_data: np.ndarray, dbfs: float=-3.0103):
+    """Nomalize audio to dBFS.
+    
+    Args:
+        sample_data (np.ndarray): input wave samples, [-1, 1].
+        dbfs (float, optional): target dBFS. Defaults to -3.0103.
+
+    Returns:
+        np.ndarray: normalized wave
+    """
+    return np.maximum(
+        np.minimum(sample_data * gain_db_to_ratio(dbfs - max_dbfs(sample_data)),
+                   1.0), -1.0)
+
+
+def _load_json_cmvn(json_cmvn_file):
+    """ Load the json format cmvn stats file and calculate cmvn
+
+    Args:
+        json_cmvn_file: cmvn stats file in json format
+
+    Returns:
+        a numpy array of [means, vars]
+    """
+    with open(json_cmvn_file) as f:
+        cmvn_stats = json.load(f)
+
+    means = cmvn_stats['mean_stat']
+    variance = cmvn_stats['var_stat']
+    count = cmvn_stats['frame_num']
+    for i in range(len(means)):
+        means[i] /= count
+        variance[i] = variance[i] / count - means[i] * means[i]
+        if variance[i] < 1.0e-20:
+            variance[i] = 1.0e-20
+        variance[i] = 1.0 / math.sqrt(variance[i])
+    cmvn = np.array([means, variance])
+    return cmvn
+
+
+def _load_kaldi_cmvn(kaldi_cmvn_file):
+    """ Load the kaldi format cmvn stats file and calculate cmvn
+
+    Args:
+        kaldi_cmvn_file:  kaldi text style global cmvn file, which
+           is generated by:
+           compute-cmvn-stats --binary=false scp:feats.scp global_cmvn
+
+    Returns:
+        a numpy array of [means, vars]
+    """
+    means = []
+    variance = []
+    with open(kaldi_cmvn_file, 'r') as fid:
+        # kaldi binary file start with '\0B'
+        if fid.read(2) == '\0B':
+            logger.error('kaldi cmvn binary file is not supported, please '
+                         'recompute it by: compute-cmvn-stats --binary=false '
+                         ' scp:feats.scp global_cmvn')
+            sys.exit(1)
+        fid.seek(0)
+        arr = fid.read().split()
+        assert (arr[0] == '[')
+        assert (arr[-2] == '0')
+        assert (arr[-1] == ']')
+        feat_dim = int((len(arr) - 2 - 2) / 2)
+        for i in range(1, feat_dim + 1):
+            means.append(float(arr[i]))
+        count = float(arr[feat_dim + 1])
+        for i in range(feat_dim + 2, 2 * feat_dim + 2):
+            variance.append(float(arr[i]))
+
+    for i in range(len(means)):
+        means[i] /= count
+        variance[i] = variance[i] / count - means[i] * means[i]
+        if variance[i] < 1.0e-20:
+            variance[i] = 1.0e-20
+        variance[i] = 1.0 / math.sqrt(variance[i])
+    cmvn = np.array([means, variance])
+    return cmvn
+
+
+def load_cmvn(cmvn_file: str, filetype: str):
+    """load cmvn from file.
+
+    Args:
+        cmvn_file (str): cmvn path.
+        filetype (str): file type, optional[npz, json, kaldi].
+
+    Raises:
+        ValueError: file type not support.
+
+    Returns:
+        Tuple[np.ndarray, np.ndarray]: mean, istd
+    """
+    assert filetype in ['npz', 'json', 'kaldi'], filetype
+    filetype = filetype.lower()
+    if filetype == "json":
+        cmvn = _load_json_cmvn(cmvn_file)
+    elif filetype == "kaldi":
+        cmvn = _load_kaldi_cmvn(cmvn_file)
+    else:
+        raise ValueError(f"cmvn file type no support: {filetype}")
+    return cmvn[0], cmvn[1]
--- a/deepspeech/io/init.py
+++ b/deepspeech/io/init.py
@ -11,25 +11,33 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
-
+import numpy as np
 from paddle.io import DataLoader

 from deepspeech.io.collator import SpeechCollator
-from deepspeech.io.sampler import SortagradDistributedBatchSampler
-from deepspeech.io.sampler import SortagradBatchSampler
 from deepspeech.io.dataset import ManifestDataset
+from deepspeech.io.sampler import SortagradBatchSampler
+from deepspeech.io.sampler import SortagradDistributedBatchSampler


 def create_dataloader(manifest_path,
+                      unit_type,
                      vocab_filepath,
                      mean_std_filepath,
+                      spm_model_prefix,
                      augmentation_config='{}',
-                      max_duration=float('inf'),
-                      min_duration=0.0,
+                      max_input_len=float('inf'),
+                      min_input_len=0.0,
+                      max_output_len=float('inf'),
+                      min_output_len=0.0,
+                      max_output_input_ratio=float('inf'),
+                      min_output_input_ratio=0.0,
                      stride_ms=10.0,
                      window_ms=20.0,
                      max_freq=None,
                      specgram_type='linear',
+                      feat_dim=None,
+                      delta_delta=False,
                      use_dB_normalization=True,
                      random_seed=0,
                      keep_transcription_text=False,
@ -41,16 +49,24 @@ def create_dataloader(manifest_path,
                      dist=False):

    dataset = ManifestDataset(
-        manifest_path,
-        vocab_filepath,
-        mean_std_filepath,
+        manifest_path=manifest_path,
+        unit_type=unit_type,
+        vocab_filepath=vocab_filepath,
+        mean_std_filepath=mean_std_filepath,
+        spm_model_prefix=spm_model_prefix,
        augmentation_config=augmentation_config,
-        max_duration=max_duration,
-        min_duration=min_duration,
+        max_input_len=max_input_len,
+        min_input_len=min_input_len,
+        max_output_len=max_output_len,
+        min_output_len=min_output_len,
+        max_output_input_ratio=max_output_input_ratio,
+        min_output_input_ratio=min_output_input_ratio,
        stride_ms=stride_ms,
        window_ms=window_ms,
        max_freq=max_freq,
        specgram_type=specgram_type,
+        feat_dim=feat_dim,
+        delta_delta=delta_delta,
        use_dB_normalization=use_dB_normalization,
        random_seed=random_seed,
        keep_transcription_text=keep_transcription_text)
@ -74,7 +90,10 @@ def create_dataloader(manifest_path,
            sortagrad=is_training,
            shuffle_method=shuffle_method)

-    def padding_batch(batch, padding_to=-1, flatten=False, is_training=True):
+    def padding_batch(batch,
+                      padding_to=-1,
+                      flatten=False,
+                      keep_transcription_text=True):
        """	
        Padding audio features with zeros to make them have the same shape (or	
        a user-defined shape) within one bach.	
@ -107,10 +126,10 @@ def create_dataloader(manifest_path,
            audio_lens.append(audio.shape[1])

            padded_text = np.zeros([max_text_length])
-            if is_training:
-                padded_text[:len(text)] = text  #ids
-            else:
+            if keep_transcription_text:
                padded_text[:len(text)] = [ord(t) for t in text]  # string
+            else:
+                padded_text[:len(text)] = text  # ids
            texts.append(padded_text)
            text_lens.append(len(text))

@ -118,11 +137,13 @@ def create_dataloader(manifest_path,
        audio_lens = np.array(audio_lens).astype('int64')
        texts = np.array(texts).astype('int32')
        text_lens = np.array(text_lens).astype('int64')
-        return padded_audios, texts, audio_lens, text_lens
+        return padded_audios, audio_lens, texts, text_lens

+    # collate_fn=functools.partial(padding_batch, keep_transcription_text=keep_transcription_text),
+    collate_fn = SpeechCollator(keep_transcription_text=keep_transcription_text)
    loader = DataLoader(
        dataset,
        batch_sampler=batch_sampler,
-        collate_fn=partial(padding_batch, is_training=is_training),
+        collate_fn=collate_fn,
        num_workers=num_workers)
    return loader
--- a/deepspeech/io/collator.py
+++ b/deepspeech/io/collator.py
@ -11,63 +11,68 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
-
-import logging
 import numpy as np
-from collections import namedtuple

-logger = logging.getLogger(__name__)
+from deepspeech.frontend.utility import IGNORE_ID
+from deepspeech.io.utility import pad_sequence
+from deepspeech.utils.log import Log
+
+__all__ = ["SpeechCollator"]

-__all__ = [
-    "SpeechCollator",
-]
+logger = Log(__name__).getlog()


 class SpeechCollator():
-    def __init__(self, padding_to=-1, is_training=True):
+    def __init__(self, keep_transcription_text=True):
        """
        Padding audio features with zeros to make them have the same shape (or
        a user-defined shape) within one bach.

-        If ``padding_to`` is -1, the maximun shape in the batch will be used
-        as the target shape for padding. Otherwise, `padding_to` will be the
-        target shape (only refers to the second axis).
+        if ``keep_transcription_text`` is False, text is token ids else is raw string.
        """
-        self._padding_to = padding_to
-        self._is_training = is_training
+        self._keep_transcription_text = keep_transcription_text

    def __call__(self, batch):
-        new_batch = []
-        # get target shape
-        max_length = max([audio.shape[1] for audio, _ in batch])
-        if self._padding_to != -1:
-            if self._padding_to < max_length:
-                raise ValueError("If padding_to is not -1, it should be larger "
-                                 "than any instance's shape in the batch")
-            max_length = self._padding_to
-        max_text_length = max([len(text) for _, text in batch])
-        # padding
-        padded_audios = []
+        """batch examples
+
+        Args:
+            batch ([List]): batch is (audio, text)
+                audio (np.ndarray) shape (D, T)
+                text (List[int] or str): shape (U,)
+
+        Returns:
+            tuple(audio, text, audio_lens, text_lens): batched data.
+                audio : (B, Tmax, D)
+                audio_lens: (B)
+                text : (B, Umax)
+                text_lens: (B)
+        """
+        audios = []
        audio_lens = []
-        texts, text_lens = [], []
+        texts = []
+        text_lens = []
        for audio, text in batch:
            # audio
-            padded_audio = np.zeros([audio.shape[0], max_length])
-            padded_audio[:, :audio.shape[1]] = audio
-            padded_audios.append(padded_audio)
+            audios.append(audio.T)  # [T, D]
            audio_lens.append(audio.shape[1])
            # text
-            padded_text = np.zeros([max_text_length])
-            if self._is_training:
-                padded_text[:len(text)] = text  # token ids
+            # for training, text is token ids
+            # else text is string, convert to unicode ord
+            tokens = []
+            if self._keep_transcription_text:
+                assert isinstance(text, str), (type(text), text)
+                tokens = [ord(t) for t in text]
            else:
-                padded_text[:len(text)] = [ord(t)
-                                           for t in text]  # string, unicode ord
-            texts.append(padded_text)
-            text_lens.append(len(text))
+                tokens = text  # token ids
+            tokens = tokens if isinstance(tokens, np.ndarray) else np.array(
+                tokens, dtype=np.int64)
+            texts.append(tokens)
+            text_lens.append(tokens.shape[0])

-        padded_audios = np.array(padded_audios).astype('float32')
-        audio_lens = np.array(audio_lens).astype('int64')
-        texts = np.array(texts).astype('int32')
-        text_lens = np.array(text_lens).astype('int64')
-        return padded_audios, texts, audio_lens, text_lens
+        padded_audios = pad_sequence(
+            audios, padding_value=0.0).astype(np.float32)  #[B, T, D]
+        audio_lens = np.array(audio_lens).astype(np.int64)
+        padded_texts = pad_sequence(
+            texts, padding_value=IGNORE_ID).astype(np.int64)
+        text_lens = np.array(text_lens).astype(np.int64)
+        return padded_audios, audio_lens, padded_texts, text_lens
--- a/deepspeech/io/dataset.py
+++ b/deepspeech/io/dataset.py
@ -11,44 +11,151 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
-
-import math
-import random
+import io
 import tarfile
-import logging
-import numpy as np
+import time
 from collections import namedtuple
-from functools import partial
+from typing import Optional

+import numpy as np
 from paddle.io import Dataset
+from yacs.config import CfgNode

-from deepspeech.frontend.utility import read_manifest
 from deepspeech.frontend.augmentor.augmentation import AugmentationPipeline
 from deepspeech.frontend.featurizer.speech_featurizer import SpeechFeaturizer
-from deepspeech.frontend.speech import SpeechSegment
 from deepspeech.frontend.normalizer import FeatureNormalizer
-
-logger = logging.getLogger(__name__)
+from deepspeech.frontend.speech import SpeechSegment
+from deepspeech.frontend.utility import read_manifest
+from deepspeech.utils.log import Log

 __all__ = [
    "ManifestDataset",
 ]

+logger = Log(__name__).getlog()
+
+# namedtupe need global for pickle.
+TarLocalData = namedtuple('TarLocalData', ['tar2info', 'tar2object'])
+

 class ManifestDataset(Dataset):
+    @classmethod
+    def params(cls, config: Optional[CfgNode]=None) -> CfgNode:
+        default = CfgNode(
+            dict(
+                train_manifest="",
+                dev_manifest="",
+                test_manifest="",
+                manifest="",
+                unit_type="char",
+                vocab_filepath="",
+                spm_model_prefix="",
+                mean_std_filepath="",
+                augmentation_config="",
+                max_input_len=27.0,
+                min_input_len=0.0,
+                max_output_len=float('inf'),
+                min_output_len=0.0,
+                max_output_input_ratio=float('inf'),
+                min_output_input_ratio=0.0,
+                stride_ms=10.0,  # ms
+                window_ms=20.0,  # ms
+                n_fft=None,  # fft points
+                max_freq=None,  # None for samplerate/2
+                raw_wav=True,  # use raw_wav or kaldi feature
+                specgram_type='linear',  # 'linear', 'mfcc', 'fbank'
+                feat_dim=0,  # 'mfcc', 'fbank'
+                delta_delta=False,  # 'mfcc', 'fbank'
+                dither=1.0,  # feature dither
+                target_sample_rate=16000,  # target sample rate
+                use_dB_normalization=True,
+                target_dB=-20,
+                random_seed=0,
+                keep_transcription_text=False,
+                batch_size=32,  # batch size
+                num_workers=0,  # data loader workers
+                sortagrad=False,  # sorted in first epoch when True
+                shuffle_method="batch_shuffle",  # 'batch_shuffle', 'instance_shuffle'
+            ))
+
+        if config is not None:
+            config.merge_from_other_cfg(default)
+        return default
+
+    @classmethod
+    def from_config(cls, config):
+        """Build a ManifestDataset object from a config.
+
+        Args:
+            config (yacs.config.CfgNode): configs object.
+
+        Returns:
+            ManifestDataset: dataet object.
+        """
+        assert 'manifest' in config.data
+        assert config.data.manifest
+        assert 'keep_transcription_text' in config.data
+
+        if isinstance(config.data.augmentation_config, (str, bytes)):
+            if config.data.augmentation_config:
+                aug_file = io.open(
+                    config.data.augmentation_config, mode='r', encoding='utf8')
+            else:
+                aug_file = io.StringIO(initial_value='{}', newline='')
+        else:
+            aug_file = config.data.augmentation_config
+            assert isinstance(aug_file, io.StringIO)
+
+        dataset = cls(
+            manifest_path=config.data.manifest,
+            unit_type=config.data.unit_type,
+            vocab_filepath=config.data.vocab_filepath,
+            mean_std_filepath=config.data.mean_std_filepath,
+            spm_model_prefix=config.data.spm_model_prefix,
+            augmentation_config=aug_file.read(),
+            max_input_len=config.data.max_input_len,
+            min_input_len=config.data.min_input_len,
+            max_output_len=config.data.max_output_len,
+            min_output_len=config.data.min_output_len,
+            max_output_input_ratio=config.data.max_output_input_ratio,
+            min_output_input_ratio=config.data.min_output_input_ratio,
+            stride_ms=config.data.stride_ms,
+            window_ms=config.data.window_ms,
+            n_fft=config.data.n_fft,
+            max_freq=config.data.max_freq,
+            target_sample_rate=config.data.target_sample_rate,
+            specgram_type=config.data.specgram_type,
+            feat_dim=config.data.feat_dim,
+            delta_delta=config.data.delta_delta,
+            dither=config.data.dither,
+            use_dB_normalization=config.data.use_dB_normalization,
+            target_dB=config.data.target_dB,
+            random_seed=config.data.random_seed,
+            keep_transcription_text=config.data.keep_transcription_text)
+        return dataset
+
    def __init__(self,
                 manifest_path,
+                 unit_type,
                 vocab_filepath,
                 mean_std_filepath,
+                 spm_model_prefix=None,
                 augmentation_config='{}',
-                 max_duration=float('inf'),
-                 min_duration=0.0,
+                 max_input_len=float('inf'),
+                 min_input_len=0.0,
+                 max_output_len=float('inf'),
+                 min_output_len=0.0,
+                 max_output_input_ratio=float('inf'),
+                 min_output_input_ratio=0.0,
                 stride_ms=10.0,
                 window_ms=20.0,
                 n_fft=None,
                 max_freq=None,
                 target_sample_rate=16000,
                 specgram_type='linear',
+                 feat_dim=None,
+                 delta_delta=False,
+                 dither=1.0,
                 use_dB_normalization=True,
                 target_dB=-20,
                 random_seed=0,
@ -57,52 +164,69 @@ class ManifestDataset(Dataset):

        Args:
            manifest_path (str): manifest josn file path
-            vocab_filepath (str): vocab file path
+            unit_type(str): token unit type, e.g. char, word, spm
+            vocab_filepath (str): vocab file path.
            mean_std_filepath (str): mean and std file path, which suffix is *.npy
+            spm_model_prefix (str): spm model prefix, need if `unit_type` is spm.
            augmentation_config (str, optional): augmentation json str. Defaults to '{}'.
-            max_duration (float, optional): audio length in seconds must less than this. Defaults to float('inf').
-            min_duration (float, optional): audio length is seconds must greater than this. Defaults to 0.0.
+            max_input_len ([type], optional): maximum output seq length, in seconds for raw wav, in frame numbers for feature data. Defaults to float('inf').
+            min_input_len (float, optional): minimum input seq length, in seconds for raw wav, in frame numbers for feature data. Defaults to 0.0.
+            max_output_len (float, optional): maximum input seq length, in modeling units. Defaults to 500.0.
+            min_output_len (float, optional): minimum input seq length, in modeling units. Defaults to 0.0.
+            max_output_input_ratio (float, optional): maximum output seq length/output seq length ratio. Defaults to 10.0.
+            min_output_input_ratio (float, optional): minimum output seq length/output seq length ratio. Defaults to 0.05.
            stride_ms (float, optional): stride size in ms. Defaults to 10.0.
            window_ms (float, optional): window size in ms. Defaults to 20.0.
            n_fft (int, optional): fft points for rfft. Defaults to None.
            max_freq (int, optional): max cut freq. Defaults to None.
            target_sample_rate (int, optional): target sample rate which used for training. Defaults to 16000.
-            specgram_type (str, optional): 'linear' or 'mfcc'. Defaults to 'linear'.
+            specgram_type (str, optional): 'linear', 'mfcc' or 'fbank'. Defaults to 'linear'.
+            feat_dim (int, optional): audio feature dim, using by 'mfcc' or 'fbank'. Defaults to None.
+            delta_delta (bool, optional): audio feature with delta-delta, using by 'fbank' or 'mfcc'. Defaults to False.
            use_dB_normalization (bool, optional): do dB normalization. Defaults to True.
            target_dB (int, optional): target dB. Defaults to -20.
            random_seed (int, optional): for random generator. Defaults to 0.
            keep_transcription_text (bool, optional): True, when not in training mode, will not do tokenizer; Defaults to False.
        """
        super().__init__()
+        self._stride_ms = stride_ms
+        self._target_sample_rate = target_sample_rate

-        self._max_duration = max_duration
-        self._min_duration = min_duration
-        self._normalizer = FeatureNormalizer(mean_std_filepath)
+        self._normalizer = FeatureNormalizer(
+            mean_std_filepath) if mean_std_filepath else None
        self._augmentation_pipeline = AugmentationPipeline(
            augmentation_config=augmentation_config, random_seed=random_seed)
        self._speech_featurizer = SpeechFeaturizer(
+            unit_type=unit_type,
            vocab_filepath=vocab_filepath,
+            spm_model_prefix=spm_model_prefix,
            specgram_type=specgram_type,
+            feat_dim=feat_dim,
+            delta_delta=delta_delta,
            stride_ms=stride_ms,
            window_ms=window_ms,
            n_fft=n_fft,
            max_freq=max_freq,
            target_sample_rate=target_sample_rate,
            use_dB_normalization=use_dB_normalization,
-            target_dB=target_dB)
-        self._rng = random.Random(random_seed)
+            target_dB=target_dB,
+            dither=dither)
+
+        self._rng = np.random.RandomState(random_seed)
        self._keep_transcription_text = keep_transcription_text
        # for caching tar files info
-        self._local_data = namedtuple('local_data', ['tar2info', 'tar2object'])
-        self._local_data.tar2info = {}
-        self._local_data.tar2object = {}
+        self._local_data = TarLocalData(tar2info={}, tar2object={})

        # read manifest
        self._manifest = read_manifest(
            manifest_path=manifest_path,
-            max_duration=self._max_duration,
-            min_duration=self._min_duration)
-        self._manifest.sort(key=lambda x: x["duration"])
+            max_input_len=max_input_len,
+            min_input_len=min_input_len,
+            max_output_len=max_output_len,
+            min_output_len=min_output_len,
+            max_output_input_ratio=max_output_input_ratio,
+            min_output_input_ratio=min_output_input_ratio)
+        self._manifest.sort(key=lambda x: x["feat_shape"][0])

    @property
    def manifest(self):
@ -110,26 +234,28 @@ class ManifestDataset(Dataset):

    @property
    def vocab_size(self):
-        """Return the vocabulary size.
-
-        :return: Vocabulary size.
-        :rtype: int
-        """
        return self._speech_featurizer.vocab_size

    @property
    def vocab_list(self):
-        """Return the vocabulary in list.
-
-        :return: Vocabulary in list.
-        :rtype: list
-        """
        return self._speech_featurizer.vocab_list

+    @property
+    def vocab_dict(self):
+        return self._speech_featurizer.vocab_dict
+
+    @property
+    def text_feature(self):
+        return self._speech_featurizer.text_feature
+
    @property
    def feature_size(self):
        return self._speech_featurizer.feature_size

+    @property
+    def stride_ms(self):
+        return self._speech_featurizer.stride_ms
+
    def _parse_tar(self, file):
        """Parse a tar file to get a tarfile object
        and a map containing tarinfoes
@ -169,15 +295,34 @@ class ManifestDataset(Dataset):
                 where transcription part could be token ids or text.
        :rtype: tuple of (2darray, list)
        """
+        start_time = time.time()
        if isinstance(audio_file, str) and audio_file.startswith('tar:'):
            speech_segment = SpeechSegment.from_file(
                self._subfile_from_tar(audio_file), transcript)
        else:
            speech_segment = SpeechSegment.from_file(audio_file, transcript)
+        load_wav_time = time.time() - start_time
+        #logger.debug(f"load wav time: {load_wav_time}")
+
+        # audio augment
+        start_time = time.time()
        self._augmentation_pipeline.transform_audio(speech_segment)
+        audio_aug_time = time.time() - start_time
+        #logger.debug(f"audio augmentation time: {audio_aug_time}")
+
+        start_time = time.time()
        specgram, transcript_part = self._speech_featurizer.featurize(
            speech_segment, self._keep_transcription_text)
-        specgram = self._normalizer.apply(specgram)
+        if self._normalizer:
+            specgram = self._normalizer.apply(specgram)
+        feature_time = time.time() - start_time
+        #logger.debug(f"audio & test feature time: {feature_time}")
+
+        # specgram augment
+        start_time = time.time()
+        specgram = self._augmentation_pipeline.transform_feature(specgram)
+        feature_aug_time = time.time() - start_time
+        #logger.debug(f"audio feature augmentation time: {feature_aug_time}")
        return specgram, transcript_part

    def _instance_reader_creator(self, manifest):
@ -191,7 +336,7 @@ class ManifestDataset(Dataset):

        def reader():
            for instance in manifest:
-                inst = self.process_utterance(instance["audio_filepath"],
+                inst = self.process_utterance(instance["feat"],
                                              instance["text"])
                yield inst

@ -202,5 +347,4 @@ class ManifestDataset(Dataset):

    def __getitem__(self, idx):
        instance = self._manifest[idx]
-        return self.process_utterance(instance["audio_filepath"],
-                                      instance["text"])
+        return self.process_utterance(instance["feat"], instance["text"])
--- a/deepspeech/io/sampler.py
+++ b/deepspeech/io/sampler.py
@ -11,27 +11,22 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
-
 import math
-import random
-import tarfile
-import logging
-import numpy as np
-from collections import namedtuple
-from functools import partial

-import paddle
+import numpy as np
+from paddle import distributed as dist
 from paddle.io import BatchSampler
 from paddle.io import DistributedBatchSampler
-from paddle import distributed as dist

-logger = logging.getLogger(__name__)
+from deepspeech.utils.log import Log

 __all__ = [
    "SortagradDistributedBatchSampler",
    "SortagradBatchSampler",
 ]

+logger = Log(__name__).getlog()
+

 def _batch_shuffle(indices, batch_size, epoch, clipped=False):
    """Put similarly-sized instances into minibatches for better efficiency
@ -59,7 +54,7 @@ def _batch_shuffle(indices, batch_size, epoch, clipped=False):
    batch_indices = list(zip(* [iter(indices[shift_len:])] * batch_size))
    rng.shuffle(batch_indices)
    batch_indices = [item for batch in batch_indices for item in batch]
-    assert (clipped == False)
+    assert clipped is False
    if not clipped:
        res_len = len(indices) - shift_len - len(batch_indices)
        # when res_len is 0, will return whole list, len(List[-0:]) = len(List[:])
@ -161,7 +156,7 @@ class SortagradDistributedBatchSampler(DistributedBatchSampler):
        for idx in _sample_iter:
            batch_indices.append(idx)
            if len(batch_indices) == self.batch_size:
-                logger.info(
+                logger.debug(
                    f"rank: {dist.get_rank()} batch index: {batch_indices} ")
                yield batch_indices
                batch_indices = []
@ -195,13 +190,13 @@ class SortagradBatchSampler(BatchSampler):
        self.dataset = dataset

        assert isinstance(batch_size, int) and batch_size > 0, \
-                "batch_size should be a positive integer"
+            "batch_size should be a positive integer"
        self.batch_size = batch_size
        assert isinstance(shuffle, bool), \
-                "shuffle should be a boolean value"
+            "shuffle should be a boolean value"
        self.shuffle = shuffle
        assert isinstance(drop_last, bool), \
-                "drop_last should be a boolean number"
+            "drop_last should be a boolean number"

        self.drop_last = drop_last
        self.epoch = 0
@ -241,7 +236,7 @@ class SortagradBatchSampler(BatchSampler):
        for idx in _sample_iter:
            batch_indices.append(idx)
            if len(batch_indices) == self.batch_size:
-                logger.info(
+                logger.debug(
                    f"rank: {dist.get_rank()} batch index: {batch_indices} ")
                yield batch_indices
                batch_indices = []
--- a/deepspeech/io/utility.py
+++ b/deepspeech/io/utility.py
@ -0,0 +1,82 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from typing import List
+
+import numpy as np
+
+from deepspeech.utils.log import Log
+
+__all__ = ["pad_sequence"]
+
+logger = Log(__name__).getlog()
+
+
+def pad_sequence(sequences: List[np.ndarray],
+                 batch_first: bool=True,
+                 padding_value: float=0.0) -> np.ndarray:
+    r"""Pad a list of variable length Tensors with ``padding_value``
+
+    ``pad_sequence`` stacks a list of Tensors along a new dimension,
+    and pads them to equal length. For example, if the input is list of
+    sequences with size ``L x *`` and if batch_first is False, and ``T x B x *``
+    otherwise.
+
+    `B` is batch size. It is equal to the number of elements in ``sequences``.
+    `T` is length of the longest sequence.
+    `L` is length of the sequence.
+    `*` is any number of trailing dimensions, including none.
+
+    Example:
+        >>> a = np.ones([25, 300])
+        >>> b = np.ones([22, 300])
+        >>> c = np.ones([15, 300])
+        >>> pad_sequence([a, b, c]).shape
+        [25, 3, 300]
+
+    Note:
+        This function returns a np.ndarray of size ``T x B x *`` or ``B x T x *``
+        where `T` is the length of the longest sequence. This function assumes
+        trailing dimensions and type of all the Tensors in sequences are same.
+
+    Args:
+        sequences (list[np.ndarray]): list of variable length sequences.
+        batch_first (bool, optional): output will be in ``B x T x *`` if True, or in
+            ``T x B x *`` otherwise
+        padding_value (float, optional): value for padded elements. Default: 0.
+
+    Returns:
+        np.ndarray of size ``T x B x *`` if :attr:`batch_first` is ``False``.
+        np.ndarray of size ``B x T x *`` otherwise
+    """
+
+    # assuming trailing dimensions and type of all the Tensors
+    # in sequences are same and fetching those from sequences[0]
+    max_size = sequences[0].shape
+    trailing_dims = max_size[1:]
+    max_len = max([s.shape[0] for s in sequences])
+    if batch_first:
+        out_dims = (len(sequences), max_len) + trailing_dims
+    else:
+        out_dims = (max_len, len(sequences)) + trailing_dims
+
+    out_tensor = np.full(out_dims, padding_value, dtype=sequences[0].dtype)
+    for i, tensor in enumerate(sequences):
+        length = tensor.shape[0]
+        # use index notation to prevent duplicate references to the tensor
+        if batch_first:
+            out_tensor[i, :length, ...] = tensor
+        else:
+            out_tensor[:length, i, ...] = tensor
+
+    return out_tensor
--- a/deepspeech/models/deepspeech2.py
+++ b/deepspeech/models/deepspeech2.py
@ -11,29 +11,21 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
-
-import math
-import collections
-import numpy as np
-import logging
+"""Deepspeech2 ASR Model"""
 from typing import Optional
-from yacs.config import CfgNode

 import paddle
 from paddle import nn
-from paddle.nn import functional as F
-from paddle.nn import initializer as I
+from yacs.config import CfgNode

-from deepspeech.modules.mask import sequence_mask
-from deepspeech.modules.activation import brelu
 from deepspeech.modules.conv import ConvStack
-from deepspeech.modules.rnn import RNNStack
 from deepspeech.modules.ctc import CTCDecoder
-
+from deepspeech.modules.rnn import RNNStack
 from deepspeech.utils import checkpoint
 from deepspeech.utils import layer_tools
+from deepspeech.utils.log import Log

-logger = logging.getLogger(__name__)
+logger = Log(__name__).getlog()

 __all__ = ['DeepSpeech2Model']

@ -67,23 +59,19 @@ class CRNNEncoder(nn.Layer):
        return self.rnn_size * 2

    def forward(self, audio, audio_len):
-        """
-        audio: shape [B, D, T]
-        text: shape [B, T]
-        audio_len: shape [B]
-        text_len: shape [B]
-        """
        """Compute Encoder outputs

        Args:
-            audio (Tensor): [B, D, T]
-            text (Tensor): [B, T]
+            audio (Tensor): [B, Tmax, D]
+            text (Tensor): [B, Umax]
            audio_len (Tensor): [B]
            text_len (Tensor): [B]
        Returns:
            x (Tensor): encoder outputs, [B, T, D]
            x_lens (Tensor): encoder length, [B]
        """
+        # [B, T, D]  -> [B, D, T]
+        audio = audio.transpose([0, 2, 1])
        # [B, D, T] -> [B, C=1, D, T]
        x = audio.unsqueeze(1)
        x_lens = audio_len
@ -166,26 +154,25 @@ class DeepSpeech2Model(nn.Layer):
        assert (self.encoder.output_size == rnn_size * 2)

        self.decoder = CTCDecoder(
+            odim=dict_size,  # <blank> is in  vocab
            enc_n_units=self.encoder.output_size,
-            odim=dict_size + 1,  # <blank> is append after vocab
-            blank_id=dict_size,  # last token is <blank>
+            blank_id=0,  # first token is <blank>
            dropout_rate=0.0,
            reduction=True,  # sum
            batch_average=True)  # sum / batch_size

-    def forward(self, audio, text, audio_len, text_len):
+    def forward(self, audio, audio_len, text, text_len):
        """Compute Model loss

        Args:
-            audio (Tenosr): [B, D, T]
-            text (Tensor): [B, T]
+            audio (Tenosr): [B, T, D]
            audio_len (Tensor): [B]
+            text (Tensor): [B, U]
            text_len (Tensor): [B]

        Returns:
            loss (Tenosr): [1]
        """
-
        eouts, eouts_len = self.encoder(audio, audio_len)
        loss = self.decoder(eouts, eouts_len, text, text_len)
        return loss
@ -204,7 +191,7 @@ class DeepSpeech2Model(nn.Layer):
            decoding_method=decoding_method)

        eouts, eouts_len = self.encoder(audio, audio_len)
-        probs = self.decoder.probs(eouts)
+        probs = self.decoder.softmax(eouts)
        return self.decoder.decode_probs(
            probs.numpy(), eouts_len, vocab_list, decoding_method,
            lang_model_path, beam_alpha, beam_beta, beam_size, cutoff_prob,
@ -235,7 +222,9 @@ class DeepSpeech2Model(nn.Layer):
                    rnn_size=config.model.rnn_layer_size,
                    use_gru=config.model.use_gru,
                    share_rnn_weights=config.model.share_rnn_weights)
-        checkpoint.load_parameters(model, checkpoint_path=checkpoint_path)
+        infos = checkpoint.load_parameters(
+            model, checkpoint_path=checkpoint_path)
+        logger.info(f"checkpoint info: {infos}")
        layer_tools.summary(model)
        return model

@ -262,12 +251,12 @@ class DeepSpeech2InferModel(DeepSpeech2Model):
        """export model function

        Args:
-            audio (Tensor): [B, D, T]
+            audio (Tensor): [B, T, D]
            audio_len (Tensor): [B]

        Returns:
            probs: probs after softmax
        """
        eouts, eouts_len = self.encoder(audio, audio_len)
-        probs = self.decoder.probs(eouts)
+        probs = self.decoder.softmax(eouts)
        return probs
--- a/deepspeech/models/u2.py
+++ b/deepspeech/models/u2.py
@ -0,0 +1,928 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""U2 ASR Model
+Unified Streaming and Non-streaming Two-pass End-to-end Model for Speech Recognition 
+(https://arxiv.org/pdf/2012.05481.pdf)
+"""
+import sys
+import time
+from collections import defaultdict
+from typing import Dict
+from typing import List
+from typing import Optional
+from typing import Tuple
+
+import paddle
+from paddle import jit
+from paddle import nn
+from yacs.config import CfgNode
+
+from deepspeech.frontend.utility import IGNORE_ID
+from deepspeech.frontend.utility import load_cmvn
+from deepspeech.modules.cmvn import GlobalCMVN
+from deepspeech.modules.ctc import CTCDecoder
+from deepspeech.modules.decoder import TransformerDecoder
+from deepspeech.modules.encoder import ConformerEncoder
+from deepspeech.modules.encoder import TransformerEncoder
+from deepspeech.modules.loss import LabelSmoothingLoss
+from deepspeech.modules.mask import make_pad_mask
+from deepspeech.modules.mask import mask_finished_preds
+from deepspeech.modules.mask import mask_finished_scores
+from deepspeech.modules.mask import subsequent_mask
+from deepspeech.utils import checkpoint
+from deepspeech.utils import layer_tools
+from deepspeech.utils.ctc_utils import remove_duplicates_and_blank
+from deepspeech.utils.log import Log
+from deepspeech.utils.tensor_utils import add_sos_eos
+from deepspeech.utils.tensor_utils import pad_sequence
+from deepspeech.utils.tensor_utils import th_accuracy
+from deepspeech.utils.utility import log_add
+
+__all__ = ["U2Model", "U2InferModel"]
+
+logger = Log(__name__).getlog()
+
+
+class U2BaseModel(nn.Module):
+    """CTC-Attention hybrid Encoder-Decoder model"""
+
+    @classmethod
+    def params(cls, config: Optional[CfgNode]=None) -> CfgNode:
+        # network architecture
+        default = CfgNode()
+        # allow add new item when merge_with_file
+        default.cmvn_file = ""
+        default.cmvn_file_type = "json"
+        default.input_dim = 0
+        default.output_dim = 0
+        # encoder related
+        default.encoder = 'transformer'
+        default.encoder_conf = CfgNode(
+            dict(
+                output_size=256,  # dimension of attention
+                attention_heads=4,
+                linear_units=2048,  # the number of units of position-wise feed forward
+                num_blocks=12,  # the number of encoder blocks
+                dropout_rate=0.1,
+                positional_dropout_rate=0.1,
+                attention_dropout_rate=0.0,
+                input_layer='conv2d',  # encoder input type, you can chose conv2d, conv2d6 and conv2d8
+                normalize_before=True,
+                # use_cnn_module=True,
+                # cnn_module_kernel=15,
+                # activation_type='swish',
+                # pos_enc_layer_type='rel_pos',
+                # selfattention_layer_type='rel_selfattn', 
+            ))
+        # decoder related
+        default.decoder = 'transformer'
+        default.decoder_conf = CfgNode(
+            dict(
+                attention_heads=4,
+                linear_units=2048,
+                num_blocks=6,
+                dropout_rate=0.1,
+                positional_dropout_rate=0.1,
+                self_attention_dropout_rate=0.0,
+                src_attention_dropout_rate=0.0, ))
+        # hybrid CTC/attention
+        default.model_conf = CfgNode(
+            dict(
+                ctc_weight=0.3,
+                lsm_weight=0.1,  # label smoothing option
+                length_normalized_loss=False, ))
+
+        if config is not None:
+            config.merge_from_other_cfg(default)
+        return default
+
+    def __init__(self,
+                 vocab_size: int,
+                 encoder: TransformerEncoder,
+                 decoder: TransformerDecoder,
+                 ctc: CTCDecoder,
+                 ctc_weight: float=0.5,
+                 ignore_id: int=IGNORE_ID,
+                 lsm_weight: float=0.0,
+                 length_normalized_loss: bool=False):
+        assert 0.0 <= ctc_weight <= 1.0, ctc_weight
+
+        super().__init__()
+        # note that eos is the same as sos (equivalent ID)
+        self.sos = vocab_size - 1
+        self.eos = vocab_size - 1
+        self.vocab_size = vocab_size
+        self.ignore_id = ignore_id
+        self.ctc_weight = ctc_weight
+
+        self.encoder = encoder
+        self.decoder = decoder
+        self.ctc = ctc
+        self.criterion_att = LabelSmoothingLoss(
+            size=vocab_size,
+            padding_idx=ignore_id,
+            smoothing=lsm_weight,
+            normalize_length=length_normalized_loss, )
+
+    def forward(
+            self,
+            speech: paddle.Tensor,
+            speech_lengths: paddle.Tensor,
+            text: paddle.Tensor,
+            text_lengths: paddle.Tensor,
+    ) -> Tuple[Optional[paddle.Tensor], Optional[paddle.Tensor], Optional[
+            paddle.Tensor]]:
+        """Frontend + Encoder + Decoder + Calc loss
+        Args:
+            speech: (Batch, Length, ...)
+            speech_lengths: (Batch, )
+            text: (Batch, Length)
+            text_lengths: (Batch,)
+        Returns:
+            total_loss, attention_loss, ctc_loss
+        """
+        assert text_lengths.dim() == 1, text_lengths.shape
+        # Check that batch_size is unified
+        assert (speech.shape[0] == speech_lengths.shape[0] == text.shape[0] ==
+                text_lengths.shape[0]), (speech.shape, speech_lengths.shape,
+                                         text.shape, text_lengths.shape)
+        # 1. Encoder
+        start = time.time()
+        encoder_out, encoder_mask = self.encoder(speech, speech_lengths)
+        encoder_time = time.time() - start
+        #logger.debug(f"encoder time: {encoder_time}")
+        #TODO(Hui Zhang): sum not support bool type
+        #encoder_out_lens = encoder_mask.squeeze(1).sum(1)  #[B, 1, T] -> [B]
+        encoder_out_lens = encoder_mask.squeeze(1).cast(paddle.int64).sum(
+            1)  #[B, 1, T] -> [B]
+
+        # 2a. Attention-decoder branch
+        loss_att = None
+        if self.ctc_weight != 1.0:
+            start = time.time()
+            loss_att, acc_att = self._calc_att_loss(encoder_out, encoder_mask,
+                                                    text, text_lengths)
+            decoder_time = time.time() - start
+            #logger.debug(f"decoder time: {decoder_time}")
+
+        # 2b. CTC branch
+        loss_ctc = None
+        if self.ctc_weight != 0.0:
+            start = time.time()
+            loss_ctc = self.ctc(encoder_out, encoder_out_lens, text,
+                                text_lengths)
+            ctc_time = time.time() - start
+            #logger.debug(f"ctc time: {ctc_time}")
+
+        if loss_ctc is None:
+            loss = loss_att
+        elif loss_att is None:
+            loss = loss_ctc
+        else:
+            loss = self.ctc_weight * loss_ctc + (1 - self.ctc_weight) * loss_att
+        return loss, loss_att, loss_ctc
+
+    def _calc_att_loss(
+            self,
+            encoder_out: paddle.Tensor,
+            encoder_mask: paddle.Tensor,
+            ys_pad: paddle.Tensor,
+            ys_pad_lens: paddle.Tensor, ) -> Tuple[paddle.Tensor, float]:
+        """Calc attention loss.
+
+        Args:
+            encoder_out (paddle.Tensor): [B, Tmax, D]
+            encoder_mask (paddle.Tensor): [B, 1, Tmax]
+            ys_pad (paddle.Tensor): [B, Umax]
+            ys_pad_lens (paddle.Tensor): [B]
+
+        Returns:
+            Tuple[paddle.Tensor, float]: attention_loss, accuracy rate
+        """
+        ys_in_pad, ys_out_pad = add_sos_eos(ys_pad, self.sos, self.eos,
+                                            self.ignore_id)
+        ys_in_lens = ys_pad_lens + 1
+
+        # 1. Forward decoder
+        decoder_out, _ = self.decoder(encoder_out, encoder_mask, ys_in_pad,
+                                      ys_in_lens)
+
+        # 2. Compute attention loss
+        loss_att = self.criterion_att(decoder_out, ys_out_pad)
+        acc_att = th_accuracy(
+            decoder_out.view(-1, self.vocab_size),
+            ys_out_pad,
+            ignore_label=self.ignore_id, )
+        return loss_att, acc_att
+
+    def _forward_encoder(
+            self,
+            speech: paddle.Tensor,
+            speech_lengths: paddle.Tensor,
+            decoding_chunk_size: int=-1,
+            num_decoding_left_chunks: int=-1,
+            simulate_streaming: bool=False,
+    ) -> Tuple[paddle.Tensor, paddle.Tensor]:
+        """Encoder pass.
+
+        Args:
+            speech (paddle.Tensor): [B, Tmax, D]
+            speech_lengths (paddle.Tensor): [B]
+            decoding_chunk_size (int, optional): chuck size. Defaults to -1.
+            num_decoding_left_chunks (int, optional): nums chunks. Defaults to -1.
+            simulate_streaming (bool, optional): streaming or not. Defaults to False.
+
+        Returns:
+            Tuple[paddle.Tensor, paddle.Tensor]: 
+                encoder hiddens (B, Tmax, D), 
+                encoder hiddens mask (B, 1, Tmax).
+        """
+        # Let's assume B = batch_size
+        # 1. Encoder
+        if simulate_streaming and decoding_chunk_size > 0:
+            encoder_out, encoder_mask = self.encoder.forward_chunk_by_chunk(
+                speech,
+                decoding_chunk_size=decoding_chunk_size,
+                num_decoding_left_chunks=num_decoding_left_chunks
+            )  # (B, maxlen, encoder_dim)
+        else:
+            encoder_out, encoder_mask = self.encoder(
+                speech,
+                speech_lengths,
+                decoding_chunk_size=decoding_chunk_size,
+                num_decoding_left_chunks=num_decoding_left_chunks
+            )  # (B, maxlen, encoder_dim)
+        return encoder_out, encoder_mask
+
+    def recognize(
+            self,
+            speech: paddle.Tensor,
+            speech_lengths: paddle.Tensor,
+            beam_size: int=10,
+            decoding_chunk_size: int=-1,
+            num_decoding_left_chunks: int=-1,
+            simulate_streaming: bool=False, ) -> paddle.Tensor:
+        """ Apply beam search on attention decoder
+        Args:
+            speech (paddle.Tensor): (batch, max_len, feat_dim)
+            speech_length (paddle.Tensor): (batch, )
+            beam_size (int): beam size for beam search
+            decoding_chunk_size (int): decoding chunk for dynamic chunk
+                trained model.
+                <0: for decoding, use full chunk.
+                >0: for decoding, use fixed chunk size as set.
+                0: used for training, it's prohibited here
+            simulate_streaming (bool): whether do encoder forward in a
+                streaming fashion
+        Returns:
+            paddle.Tensor: decoding result, (batch, max_result_len)
+        """
+        assert speech.shape[0] == speech_lengths.shape[0]
+        assert decoding_chunk_size != 0
+        device = speech.place
+        batch_size = speech.shape[0]
+
+        # Let's assume B = batch_size and N = beam_size
+        # 1. Encoder
+        encoder_out, encoder_mask = self._forward_encoder(
+            speech, speech_lengths, decoding_chunk_size,
+            num_decoding_left_chunks,
+            simulate_streaming)  # (B, maxlen, encoder_dim)
+        maxlen = encoder_out.size(1)
+        encoder_dim = encoder_out.size(2)
+        running_size = batch_size * beam_size
+        encoder_out = encoder_out.unsqueeze(1).repeat(1, beam_size, 1, 1).view(
+            running_size, maxlen, encoder_dim)  # (B*N, maxlen, encoder_dim)
+        encoder_mask = encoder_mask.unsqueeze(1).repeat(
+            1, beam_size, 1, 1).view(running_size, 1,
+                                     maxlen)  # (B*N, 1, max_len)
+
+        hyps = paddle.ones(
+            [running_size, 1], dtype=paddle.long).fill_(self.sos)  # (B*N, 1)
+        # log scale score
+        scores = paddle.to_tensor(
+            [0.0] + [-float('inf')] * (beam_size - 1), dtype=paddle.float)
+        scores = scores.to(device).repeat(batch_size).unsqueeze(1).to(
+            device)  # (B*N, 1)
+        end_flag = paddle.zeros_like(scores, dtype=paddle.bool)  # (B*N, 1)
+        cache: Optional[List[paddle.Tensor]] = None
+        # 2. Decoder forward step by step
+        for i in range(1, maxlen + 1):
+            # Stop if all batch and all beam produce eos
+            # TODO(Hui Zhang): if end_flag.sum() == running_size:
+            if end_flag.cast(paddle.int64).sum() == running_size:
+                break
+
+            # 2.1 Forward decoder step
+            hyps_mask = subsequent_mask(i).unsqueeze(0).repeat(
+                running_size, 1, 1).to(device)  # (B*N, i, i)
+            # logp: (B*N, vocab)
+            logp, cache = self.decoder.forward_one_step(
+                encoder_out, encoder_mask, hyps, hyps_mask, cache)
+
+            # 2.2 First beam prune: select topk best prob at current time
+            top_k_logp, top_k_index = logp.topk(beam_size)  # (B*N, N)
+            top_k_logp = mask_finished_scores(top_k_logp, end_flag)
+            top_k_index = mask_finished_preds(top_k_index, end_flag, self.eos)
+
+            # 2.3 Seconde beam prune: select topk score with history
+            scores = scores + top_k_logp  # (B*N, N), broadcast add
+            scores = scores.view(batch_size, beam_size * beam_size)  # (B, N*N)
+            scores, offset_k_index = scores.topk(k=beam_size)  # (B, N)
+            scores = scores.view(-1, 1)  # (B*N, 1)
+
+            # 2.4. Compute base index in top_k_index,
+            # regard top_k_index as (B*N*N),regard offset_k_index as (B*N),
+            # then find offset_k_index in top_k_index
+            base_k_index = paddle.arange(batch_size).view(-1, 1).repeat(
+                1, beam_size)  # (B, N)
+            base_k_index = base_k_index * beam_size * beam_size
+            best_k_index = base_k_index.view(-1) + offset_k_index.view(
+                -1)  # (B*N)
+
+            # 2.5 Update best hyps
+            best_k_pred = paddle.index_select(
+                top_k_index.view(-1), index=best_k_index, axis=0)  # (B*N)
+            best_hyps_index = best_k_index // beam_size
+            last_best_k_hyps = paddle.index_select(
+                hyps, index=best_hyps_index, axis=0)  # (B*N, i)
+            hyps = paddle.cat(
+                (last_best_k_hyps, best_k_pred.view(-1, 1)),
+                dim=1)  # (B*N, i+1)
+
+            # 2.6 Update end flag
+            end_flag = paddle.eq(hyps[:, -1], self.eos).view(-1, 1)
+
+        # 3. Select best of best
+        scores = scores.view(batch_size, beam_size)
+        # TODO: length normalization
+        best_index = paddle.argmax(scores, axis=-1).long()  # (B)
+        best_hyps_index = best_index + paddle.arange(
+            batch_size, dtype=paddle.long) * beam_size
+        best_hyps = paddle.index_select(hyps, index=best_hyps_index, axis=0)
+        best_hyps = best_hyps[:, 1:]
+        return best_hyps
+
+    def ctc_greedy_search(
+            self,
+            speech: paddle.Tensor,
+            speech_lengths: paddle.Tensor,
+            decoding_chunk_size: int=-1,
+            num_decoding_left_chunks: int=-1,
+            simulate_streaming: bool=False, ) -> List[List[int]]:
+        """ Apply CTC greedy search
+        Args:
+            speech (paddle.Tensor): (batch, max_len, feat_dim)
+            speech_length (paddle.Tensor): (batch, )
+            beam_size (int): beam size for beam search
+            decoding_chunk_size (int): decoding chunk for dynamic chunk
+                trained model.
+                <0: for decoding, use full chunk.
+                >0: for decoding, use fixed chunk size as set.
+                0: used for training, it's prohibited here
+            simulate_streaming (bool): whether do encoder forward in a
+                streaming fashion
+        Returns:
+            List[List[int]]: best path result
+        """
+        assert speech.shape[0] == speech_lengths.shape[0]
+        assert decoding_chunk_size != 0
+        batch_size = speech.shape[0]
+        # Let's assume B = batch_size
+        # encoder_out: (B, maxlen, encoder_dim)
+        # encoder_mask: (B, 1, Tmax)
+        encoder_out, encoder_mask = self._forward_encoder(
+            speech, speech_lengths, decoding_chunk_size,
+            num_decoding_left_chunks, simulate_streaming)
+        maxlen = encoder_out.size(1)
+        # (TODO Hui Zhang): bool no support reduce_sum
+        # encoder_out_lens = encoder_mask.squeeze(1).sum(1)
+        encoder_out_lens = encoder_mask.squeeze(1).astype(paddle.int).sum(1)
+        ctc_probs = self.ctc.log_softmax(encoder_out)  # (B, maxlen, vocab_size)
+        topk_prob, topk_index = ctc_probs.topk(1, axis=2)  # (B, maxlen, 1)
+        topk_index = topk_index.view(batch_size, maxlen)  # (B, maxlen)
+        pad_mask = make_pad_mask(encoder_out_lens)  # (B, maxlen)
+        topk_index = topk_index.masked_fill_(pad_mask, self.eos)  # (B, maxlen)
+        hyps = [hyp.tolist() for hyp in topk_index]
+        hyps = [remove_duplicates_and_blank(hyp) for hyp in hyps]
+        return hyps
+
+    def _ctc_prefix_beam_search(
+            self,
+            speech: paddle.Tensor,
+            speech_lengths: paddle.Tensor,
+            beam_size: int,
+            decoding_chunk_size: int=-1,
+            num_decoding_left_chunks: int=-1,
+            simulate_streaming: bool=False,
+            blank_id: int=0, ) -> Tuple[List[Tuple[int, float]], paddle.Tensor]:
+        """ CTC prefix beam search inner implementation
+        Args:
+            speech (paddle.Tensor): (batch, max_len, feat_dim)
+            speech_length (paddle.Tensor): (batch, )
+            beam_size (int): beam size for beam search
+            decoding_chunk_size (int): decoding chunk for dynamic chunk
+                trained model.
+                <0: for decoding, use full chunk.
+                >0: for decoding, use fixed chunk size as set.
+                0: used for training, it's prohibited here
+            simulate_streaming (bool): whether do encoder forward in a
+                streaming fashion
+        Returns:
+            List[Tuple[int, float]]: nbest results, (N,1), (text, likelihood)
+            paddle.Tensor: encoder output, (1, max_len, encoder_dim),
+                it will be used for rescoring in attention rescoring mode
+        """
+        assert speech.shape[0] == speech_lengths.shape[0]
+        assert decoding_chunk_size != 0
+        batch_size = speech.shape[0]
+        # For CTC prefix beam search, we only support batch_size=1
+        assert batch_size == 1
+        # Let's assume B = batch_size and N = beam_size
+        # 1. Encoder forward and get CTC score
+        encoder_out, encoder_mask = self._forward_encoder(
+            speech, speech_lengths, decoding_chunk_size,
+            num_decoding_left_chunks,
+            simulate_streaming)  # (B, maxlen, encoder_dim)
+        maxlen = encoder_out.size(1)
+        ctc_probs = self.ctc.log_softmax(encoder_out)  # (1, maxlen, vocab_size)
+        ctc_probs = ctc_probs.squeeze(0)
+        # cur_hyps: (prefix, (blank_ending_score, none_blank_ending_score))
+        cur_hyps = [(tuple(), (0.0, -float('inf')))]
+        # 2. CTC beam search step by step
+        for t in range(0, maxlen):
+            logp = ctc_probs[t]  # (vocab_size,)
+            # key: prefix, value (pb, pnb), default value(-inf, -inf)
+            next_hyps = defaultdict(lambda: (-float('inf'), -float('inf')))
+            # 2.1 First beam prune: select topk best
+            top_k_logp, top_k_index = logp.topk(beam_size)  # (beam_size,)
+            for s in top_k_index:
+                s = s.item()
+                ps = logp[s].item()
+                for prefix, (pb, pnb) in cur_hyps:
+                    last = prefix[-1] if len(prefix) > 0 else None
+                    if s == blank_id:  # blank
+                        n_pb, n_pnb = next_hyps[prefix]
+                        n_pb = log_add([n_pb, pb + ps, pnb + ps])
+                        next_hyps[prefix] = (n_pb, n_pnb)
+                    elif s == last:
+                        #  Update *ss -> *s;
+                        n_pb, n_pnb = next_hyps[prefix]
+                        n_pnb = log_add([n_pnb, pnb + ps])
+                        next_hyps[prefix] = (n_pb, n_pnb)
+                        # Update *s-s -> *ss, - is for blank
+                        n_prefix = prefix + (s, )
+                        n_pb, n_pnb = next_hyps[n_prefix]
+                        n_pnb = log_add([n_pnb, pb + ps])
+                        next_hyps[n_prefix] = (n_pb, n_pnb)
+                    else:
+                        n_prefix = prefix + (s, )
+                        n_pb, n_pnb = next_hyps[n_prefix]
+                        n_pnb = log_add([n_pnb, pb + ps, pnb + ps])
+                        next_hyps[n_prefix] = (n_pb, n_pnb)
+
+            # 2.2 Second beam prune
+            next_hyps = sorted(
+                next_hyps.items(),
+                key=lambda x: log_add(list(x[1])),
+                reverse=True)
+            cur_hyps = next_hyps[:beam_size]
+        hyps = [(y[0], log_add([y[1][0], y[1][1]])) for y in cur_hyps]
+        return hyps, encoder_out
+
+    def ctc_prefix_beam_search(
+            self,
+            speech: paddle.Tensor,
+            speech_lengths: paddle.Tensor,
+            beam_size: int,
+            decoding_chunk_size: int=-1,
+            num_decoding_left_chunks: int=-1,
+            simulate_streaming: bool=False, ) -> List[int]:
+        """ Apply CTC prefix beam search
+        Args:
+            speech (paddle.Tensor): (batch, max_len, feat_dim)
+            speech_length (paddle.Tensor): (batch, )
+            beam_size (int): beam size for beam search
+            decoding_chunk_size (int): decoding chunk for dynamic chunk
+                trained model.
+                <0: for decoding, use full chunk.
+                >0: for decoding, use fixed chunk size as set.
+                0: used for training, it's prohibited here
+            simulate_streaming (bool): whether do encoder forward in a
+                streaming fashion
+        Returns:
+            List[int]: CTC prefix beam search nbest results
+        """
+        hyps, _ = self._ctc_prefix_beam_search(
+            speech, speech_lengths, beam_size, decoding_chunk_size,
+            num_decoding_left_chunks, simulate_streaming)
+        return hyps[0][0]
+
+    def attention_rescoring(
+            self,
+            speech: paddle.Tensor,
+            speech_lengths: paddle.Tensor,
+            beam_size: int,
+            decoding_chunk_size: int=-1,
+            num_decoding_left_chunks: int=-1,
+            ctc_weight: float=0.0,
+            simulate_streaming: bool=False, ) -> List[int]:
+        """ Apply attention rescoring decoding, CTC prefix beam search
+            is applied first to get nbest, then we resoring the nbest on
+            attention decoder with corresponding encoder out
+        Args:
+            speech (paddle.Tensor): (batch, max_len, feat_dim)
+            speech_length (paddle.Tensor): (batch, )
+            beam_size (int): beam size for beam search
+            decoding_chunk_size (int): decoding chunk for dynamic chunk
+                trained model.
+                <0: for decoding, use full chunk.
+                >0: for decoding, use fixed chunk size as set.
+                0: used for training, it's prohibited here
+            simulate_streaming (bool): whether do encoder forward in a
+                streaming fashion
+        Returns:
+            List[int]: Attention rescoring result
+        """
+        assert speech.shape[0] == speech_lengths.shape[0]
+        assert decoding_chunk_size != 0
+        device = speech.place
+        batch_size = speech.shape[0]
+        # For attention rescoring we only support batch_size=1
+        assert batch_size == 1
+        # encoder_out: (1, maxlen, encoder_dim), len(hyps) = beam_size
+        hyps, encoder_out = self._ctc_prefix_beam_search(
+            speech, speech_lengths, beam_size, decoding_chunk_size,
+            num_decoding_left_chunks, simulate_streaming)
+
+        assert len(hyps) == beam_size
+        hyps_pad = pad_sequence([
+            paddle.to_tensor(hyp[0], place=device, dtype=paddle.long)
+            for hyp in hyps
+        ], True, self.ignore_id)  # (beam_size, max_hyps_len)
+        hyps_lens = paddle.to_tensor(
+            [len(hyp[0]) for hyp in hyps], place=device,
+            dtype=paddle.long)  # (beam_size,)
+        hyps_pad, _ = add_sos_eos(hyps_pad, self.sos, self.eos, self.ignore_id)
+        hyps_lens = hyps_lens + 1  # Add <sos> at begining
+        encoder_out = encoder_out.repeat(beam_size, 1, 1)
+        encoder_mask = paddle.ones(
+            (beam_size, 1, encoder_out.size(1)), dtype=paddle.bool)
+        decoder_out, _ = self.decoder(
+            encoder_out, encoder_mask, hyps_pad,
+            hyps_lens)  # (beam_size, max_hyps_len, vocab_size)
+        decoder_out = paddle.nn.functional.log_softmax(decoder_out, axis=-1)
+        decoder_out = decoder_out.numpy()
+        # Only use decoder score for rescoring
+        best_score = -float('inf')
+        best_index = 0
+        for i, hyp in enumerate(hyps):
+            score = 0.0
+            for j, w in enumerate(hyp[0]):
+                score += decoder_out[i][j][w]
+            score += decoder_out[i][len(hyp[0])][self.eos]
+            # add ctc score
+            score += hyp[1] * ctc_weight
+            if score > best_score:
+                best_score = score
+                best_index = i
+        return hyps[best_index][0]
+
+    @jit.export
+    def subsampling_rate(self) -> int:
+        """ Export interface for c++ call, return subsampling_rate of the
+            model
+        """
+        return self.encoder.embed.subsampling_rate
+
+    @jit.export
+    def right_context(self) -> int:
+        """ Export interface for c++ call, return right_context of the model
+        """
+        return self.encoder.embed.right_context
+
+    @jit.export
+    def sos_symbol(self) -> int:
+        """ Export interface for c++ call, return sos symbol id of the model
+        """
+        return self.sos
+
+    @jit.export
+    def eos_symbol(self) -> int:
+        """ Export interface for c++ call, return eos symbol id of the model
+        """
+        return self.eos
+
+    @jit.export
+    def forward_encoder_chunk(
+            self,
+            xs: paddle.Tensor,
+            offset: int,
+            required_cache_size: int,
+            subsampling_cache: Optional[paddle.Tensor]=None,
+            elayers_output_cache: Optional[List[paddle.Tensor]]=None,
+            conformer_cnn_cache: Optional[List[paddle.Tensor]]=None,
+    ) -> Tuple[paddle.Tensor, paddle.Tensor, List[paddle.Tensor], List[
+            paddle.Tensor]]:
+        """ Export interface for c++ call, give input chunk xs, and return
+            output from time 0 to current chunk.
+        Args:
+            xs (paddle.Tensor): chunk input
+            subsampling_cache (Optional[paddle.Tensor]): subsampling cache
+            elayers_output_cache (Optional[List[paddle.Tensor]]):
+                transformer/conformer encoder layers output cache
+            conformer_cnn_cache (Optional[List[paddle.Tensor]]): conformer
+                cnn cache
+        Returns:
+            paddle.Tensor: output, it ranges from time 0 to current chunk.
+            paddle.Tensor: subsampling cache
+            List[paddle.Tensor]: attention cache
+            List[paddle.Tensor]: conformer cnn cache
+        """
+        return self.encoder.forward_chunk(
+            xs, offset, required_cache_size, subsampling_cache,
+            elayers_output_cache, conformer_cnn_cache)
+
+    @jit.export
+    def ctc_activation(self, xs: paddle.Tensor) -> paddle.Tensor:
+        """ Export interface for c++ call, apply linear transform and log
+            softmax before ctc
+        Args:
+            xs (paddle.Tensor): encoder output
+        Returns:
+            paddle.Tensor: activation before ctc
+        """
+        return self.ctc.log_softmax(xs)
+
+    @jit.export
+    def forward_attention_decoder(
+            self,
+            hyps: paddle.Tensor,
+            hyps_lens: paddle.Tensor,
+            encoder_out: paddle.Tensor, ) -> paddle.Tensor:
+        """ Export interface for c++ call, forward decoder with multiple
+            hypothesis from ctc prefix beam search and one encoder output
+        Args:
+            hyps (paddle.Tensor): hyps from ctc prefix beam search, already
+                pad sos at the begining, (B, T)
+            hyps_lens (paddle.Tensor): length of each hyp in hyps, (B)
+            encoder_out (paddle.Tensor): corresponding encoder output, (B=1, T, D)
+        Returns:
+            paddle.Tensor: decoder output, (B, L)
+        """
+        assert encoder_out.size(0) == 1
+        num_hyps = hyps.size(0)
+        assert hyps_lens.size(0) == num_hyps
+        encoder_out = encoder_out.repeat(num_hyps, 1, 1)
+        # (B, 1, T)
+        encoder_mask = paddle.ones(
+            [num_hyps, 1, encoder_out.size(1)], dtype=paddle.bool)
+        # (num_hyps, max_hyps_len, vocab_size)
+        decoder_out, _ = self.decoder(encoder_out, encoder_mask, hyps,
+                                      hyps_lens)
+        decoder_out = paddle.nn.functional.log_softmax(decoder_out, dim=-1)
+        return decoder_out
+
+    @paddle.no_grad()
+    def decode(self,
+               feats: paddle.Tensor,
+               feats_lengths: paddle.Tensor,
+               text_feature: Dict[str, int],
+               decoding_method: str,
+               lang_model_path: str,
+               beam_alpha: float,
+               beam_beta: float,
+               beam_size: int,
+               cutoff_prob: float,
+               cutoff_top_n: int,
+               num_processes: int,
+               ctc_weight: float=0.0,
+               decoding_chunk_size: int=-1,
+               num_decoding_left_chunks: int=-1,
+               simulate_streaming: bool=False):
+        """u2 decoding.
+
+        Args:
+            feats (Tenosr): audio features, (B, T, D)
+            feats_lengths (Tenosr): (B)
+            text_feature (TextFeaturizer): text feature object.
+            decoding_method (str): decoding mode, e.g. 
+                    'attention', 'ctc_greedy_search', 
+                    'ctc_prefix_beam_search', 'attention_rescoring'
+            lang_model_path (str): lm path.
+            beam_alpha (float): lm weight.
+            beam_beta (float): length penalty.
+            beam_size (int): beam size for search
+            cutoff_prob (float): for prune.
+            cutoff_top_n (int): for prune.
+            num_processes (int): 
+            ctc_weight (float, optional): ctc weight for attention rescoring decode mode. Defaults to 0.0.
+            decoding_chunk_size (int, optional): decoding chunk size. Defaults to -1.
+                    <0: for decoding, use full chunk.
+                    >0: for decoding, use fixed chunk size as set.
+                    0: used for training, it's prohibited here. 
+            num_decoding_left_chunks (int, optional): 
+                    number of left chunks for decoding. Defaults to -1.
+            simulate_streaming (bool, optional): simulate streaming inference. Defaults to False.
+
+        Raises:
+            ValueError: when not support decoding_method.
+        
+        Returns:
+            List[List[int]]: transcripts.
+        """
+        batch_size = feats.size(0)
+        if decoding_method in ['ctc_prefix_beam_search',
+                               'attention_rescoring'] and batch_size > 1:
+            logger.fatal(
+                f'decoding mode {decoding_method} must be running with batch_size == 1'
+            )
+            sys.exit(1)
+
+        if decoding_method == 'attention':
+            hyps = self.recognize(
+                feats,
+                feats_lengths,
+                beam_size=beam_size,
+                decoding_chunk_size=decoding_chunk_size,
+                num_decoding_left_chunks=num_decoding_left_chunks,
+                simulate_streaming=simulate_streaming)
+            hyps = [hyp.tolist() for hyp in hyps]
+        elif decoding_method == 'ctc_greedy_search':
+            hyps = self.ctc_greedy_search(
+                feats,
+                feats_lengths,
+                decoding_chunk_size=decoding_chunk_size,
+                num_decoding_left_chunks=num_decoding_left_chunks,
+                simulate_streaming=simulate_streaming)
+        # ctc_prefix_beam_search and attention_rescoring only return one
+        # result in List[int], change it to List[List[int]] for compatible
+        # with other batch decoding mode
+        elif decoding_method == 'ctc_prefix_beam_search':
+            assert feats.size(0) == 1
+            hyp = self.ctc_prefix_beam_search(
+                feats,
+                feats_lengths,
+                beam_size,
+                decoding_chunk_size=decoding_chunk_size,
+                num_decoding_left_chunks=num_decoding_left_chunks,
+                simulate_streaming=simulate_streaming)
+            hyps = [hyp]
+        elif decoding_method == 'attention_rescoring':
+            assert feats.size(0) == 1
+            hyp = self.attention_rescoring(
+                feats,
+                feats_lengths,
+                beam_size,
+                decoding_chunk_size=decoding_chunk_size,
+                num_decoding_left_chunks=num_decoding_left_chunks,
+                ctc_weight=ctc_weight,
+                simulate_streaming=simulate_streaming)
+            hyps = [hyp]
+        else:
+            raise ValueError(f"Not support decoding method: {decoding_method}")
+
+        res = [text_feature.defeaturize(hyp) for hyp in hyps]
+        return res
+
+
+class U2Model(U2BaseModel):
+    def __init__(self, configs: dict):
+        vocab_size, encoder, decoder, ctc = U2Model._init_from_config(configs)
+
+        super().__init__(
+            vocab_size=vocab_size,
+            encoder=encoder,
+            decoder=decoder,
+            ctc=ctc,
+            **configs['model_conf'])
+
+    @classmethod
+    def _init_from_config(cls, configs: dict):
+        """init sub module for model.
+
+        Args:
+            configs (dict): config dict.
+
+        Raises:
+            ValueError: raise when using not support encoder type.
+
+        Returns:
+            int, nn.Layer, nn.Layer, nn.Layer: vocab size, encoder, decoder, ctc 
+        """
+        if configs['cmvn_file'] is not None:
+            mean, istd = load_cmvn(configs['cmvn_file'],
+                                   configs['cmvn_file_type'])
+            global_cmvn = GlobalCMVN(
+                paddle.to_tensor(mean, dtype=paddle.float),
+                paddle.to_tensor(istd, dtype=paddle.float))
+        else:
+            global_cmvn = None
+
+        input_dim = configs['input_dim']
+        vocab_size = configs['output_dim']
+        assert input_dim != 0, input_dim
+        assert vocab_size != 0, vocab_size
+
+        encoder_type = configs.get('encoder', 'transformer')
+        logger.info(f"U2 Encoder type: {encoder_type}")
+        if encoder_type == 'transformer':
+            encoder = TransformerEncoder(
+                input_dim, global_cmvn=global_cmvn, **configs['encoder_conf'])
+        elif encoder_type == 'conformer':
+            encoder = ConformerEncoder(
+                input_dim, global_cmvn=global_cmvn, **configs['encoder_conf'])
+        else:
+            raise ValueError(f"not support encoder type:{encoder_type}")
+
+        decoder = TransformerDecoder(vocab_size,
+                                     encoder.output_size(),
+                                     **configs['decoder_conf'])
+        ctc = CTCDecoder(
+            odim=vocab_size,
+            enc_n_units=encoder.output_size(),
+            blank_id=0,
+            dropout_rate=0.0,
+            reduction=True,  # sum
+            batch_average=True)  # sum / batch_size
+
+        return vocab_size, encoder, decoder, ctc
+
+    @classmethod
+    def from_config(cls, configs: dict):
+        """init model.
+
+        Args:
+            configs (dict): config dict.
+
+        Raises:
+            ValueError: raise when using not support encoder type.
+
+        Returns:
+            nn.Layer: U2Model
+        """
+        model = cls(configs)
+        return model
+
+    @classmethod
+    def from_pretrained(cls, dataset, config, checkpoint_path):
+        """Build a DeepSpeech2Model model from a pretrained model.
+
+        Args:
+            dataset (paddle.io.Dataset): not used.
+            config (yacs.config.CfgNode):  model configs
+            checkpoint_path (Path or str): the path of pretrained model checkpoint, without extension name
+
+        Returns:
+            DeepSpeech2Model: The model built from pretrained result.
+        """
+        config.defrost()
+        config.input_dim = dataset.feature_size
+        config.output_dim = dataset.vocab_size
+        config.freeze()
+        model = cls.from_config(config)
+
+        if checkpoint_path:
+            infos = checkpoint.load_parameters(
+                model, checkpoint_path=checkpoint_path)
+            logger.info(f"checkpoint info: {infos}")
+        layer_tools.summary(model)
+        return model
+
+
+class U2InferModel(U2Model):
+    def __init__(self, configs: dict):
+        super().__init__(configs)
+
+    def forward(self,
+                feats,
+                feats_lengths,
+                decoding_chunk_size=-1,
+                num_decoding_left_chunks=-1,
+                simulate_streaming=False):
+        """export model function
+
+        Args:
+            feats (Tensor): [B, T, D]
+            feats_lengths (Tensor): [B]
+
+        Returns:
+            List[List[int]]: best path result
+        """
+        return self.ctc_greedy_search(
+            feats,
+            feats_lengths,
+            decoding_chunk_size=decoding_chunk_size,
+            num_decoding_left_chunks=num_decoding_left_chunks,
+            simulate_streaming=simulate_streaming)
--- a/deepspeech/modules/activation.py
+++ b/deepspeech/modules/activation.py
@ -11,19 +11,16 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
-
-import logging
-import numpy as np
-import math
+from collections import OrderedDict

 import paddle
 from paddle import nn
-from paddle.nn import functional as F
-from paddle.nn import initializer as I

-logger = logging.getLogger(__name__)
+from deepspeech.utils.log import Log
+
+logger = Log(__name__).getlog()

-__all__ = ['brelu', "softplus", "gelu_accurate", "gelu", 'Swish']
+__all__ = ["get_activation", "brelu", "LinearGLUBlock", "ConvGLUBlock"]


 def brelu(x, t_min=0.0, t_max=24.0, name=None):
@ -33,36 +30,116 @@ def brelu(x, t_min=0.0, t_max=24.0, name=None):
    return x.maximum(t_min).minimum(t_max)


-def softplus(x):
-    """Softplus function."""
-    if hasattr(paddle.nn.functional, 'softplus'):
-        #return paddle.nn.functional.softplus(x.float()).type_as(x)
-        return paddle.nn.functional.softplus(x)
-    else:
-        raise NotImplementedError
+class LinearGLUBlock(nn.Layer):
+    """A linear Gated Linear Units (GLU) block."""
+
+    def __init__(self, idim: int):
+        """ GLU.
+        Args:
+            idim (int): input and output dimension
+        """
+        super().__init__()
+        self.fc = nn.Linear(idim, idim * 2)
+
+    def forward(self, xs):
+        return glu(self.fc(xs), dim=-1)
+
+
+class ConvGLUBlock(nn.Layer):
+    def __init__(self, kernel_size, in_ch, out_ch, bottlececk_dim=0,
+                 dropout=0.):
+        """A convolutional Gated Linear Units (GLU) block.
+
+        Args:
+            kernel_size (int): kernel size
+            in_ch (int): number of input channels
+            out_ch (int): number of output channels
+            bottlececk_dim (int): dimension of the bottleneck layers for computational efficiency. Defaults to 0.
+            dropout (float): dropout probability. Defaults to 0..
+        """
+
+        super().__init__()
+
+        self.conv_residual = None
+        if in_ch != out_ch:
+            self.conv_residual = nn.utils.weight_norm(
+                nn.Conv2D(
+                    in_channels=in_ch, out_channels=out_ch, kernel_size=(1, 1)),
+                name='weight',
+                dim=0)
+            self.dropout_residual = nn.Dropout(p=dropout)
+
+        self.pad_left = ConstantPad2d((0, 0, kernel_size - 1, 0), 0)

+        layers = OrderedDict()
+        if bottlececk_dim == 0:
+            layers['conv'] = nn.utils.weight_norm(
+                nn.Conv2D(
+                    in_channels=in_ch,
+                    out_channels=out_ch * 2,
+                    kernel_size=(kernel_size, 1)),
+                name='weight',
+                dim=0)
+            # TODO(hirofumi0810): padding?
+            layers['dropout'] = nn.Dropout(p=dropout)
+            layers['glu'] = GLU()

-def gelu_accurate(x):
-    """Gaussian Error Linear Units (GELU) activation."""
-    # [reference] https://github.com/pytorch/fairseq/blob/e75cff5f2c1d62f12dc911e0bf420025eb1a4e33/fairseq/modules/gelu.py
-    if not hasattr(gelu_accurate, "_a"):
-        gelu_accurate._a = math.sqrt(2 / math.pi)
-    return 0.5 * x * (1 + paddle.tanh(gelu_accurate._a *
-                                      (x + 0.044715 * paddle.pow(x, 3))))
+        elif bottlececk_dim > 0:
+            layers['conv_in'] = nn.utils.weight_norm(
+                nn.Conv2D(
+                    in_channels=in_ch,
+                    out_channels=bottlececk_dim,
+                    kernel_size=(1, 1)),
+                name='weight',
+                dim=0)
+            layers['dropout_in'] = nn.Dropout(p=dropout)
+            layers['conv_bottleneck'] = nn.utils.weight_norm(
+                nn.Conv2D(
+                    in_channels=bottlececk_dim,
+                    out_channels=bottlececk_dim,
+                    kernel_size=(kernel_size, 1)),
+                name='weight',
+                dim=0)
+            layers['dropout'] = nn.Dropout(p=dropout)
+            layers['glu'] = GLU()
+            layers['conv_out'] = nn.utils.weight_norm(
+                nn.Conv2D(
+                    in_channels=bottlececk_dim,
+                    out_channels=out_ch * 2,
+                    kernel_size=(1, 1)),
+                name='weight',
+                dim=0)
+            layers['dropout_out'] = nn.Dropout(p=dropout)

+        self.layers = nn.Sequential(layers)

-def gelu(x):
-    """Gaussian Error Linear Units (GELU) activation."""
-    if hasattr(torch.nn.functional, 'gelu'):
-        #return torch.nn.functional.gelu(x.float()).type_as(x)
-        return torch.nn.functional.gelu(x)
-    else:
-        return x * 0.5 * (1.0 + paddle.erf(x / math.sqrt(2.0)))
+    def forward(self, xs):
+        """Forward pass.
+        Args:
+            xs (FloatTensor): `[B, in_ch, T, feat_dim]`
+        Returns:
+            out (FloatTensor): `[B, out_ch, T, feat_dim]`
+        """
+        residual = xs
+        if self.conv_residual is not None:
+            residual = self.dropout_residual(self.conv_residual(residual))
+        xs = self.pad_left(xs)  # `[B, embed_dim, T+kernel-1, 1]`
+        xs = self.layers(xs)  # `[B, out_ch * 2, T ,1]`
+        xs = xs + residual
+        return xs


-class Swish(nn.Layer):
-    """Construct an Swish object."""
+def get_activation(act):
+    """Return activation function."""
+    # Lazy load to avoid unused import
+    activation_funcs = {
+        "hardtanh": paddle.nn.Hardtanh,
+        "tanh": paddle.nn.Tanh,
+        "relu": paddle.nn.ReLU,
+        "selu": paddle.nn.SELU,
+        "swish": paddle.nn.Swish,
+        "gelu": paddle.nn.GELU,
+        "brelu": brelu,
+    }

-    def forward(self, x: paddle.Tensor) -> paddle.Tensor:
-        """Return Swish activation function."""
-        return x * F.sigmoid(x)
+    return activation_funcs[act]()
--- a/deepspeech/modules/attention.py
+++ b/deepspeech/modules/attention.py
@ -0,0 +1,233 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Multi-Head Attention layer definition."""
+import math
+from typing import Optional
+from typing import Tuple
+
+import paddle
+from paddle import nn
+from paddle.nn import initializer as I
+
+from deepspeech.utils.log import Log
+
+logger = Log(__name__).getlog()
+
+__all__ = ["MultiHeadedAttention", "RelPositionMultiHeadedAttention"]
+
+# Relative Positional Encodings
+# https://www.jianshu.com/p/c0608efcc26f
+# https://zhuanlan.zhihu.com/p/344604604
+
+
+class MultiHeadedAttention(nn.Layer):
+    """Multi-Head Attention layer."""
+
+    def __init__(self, n_head: int, n_feat: int, dropout_rate: float):
+        """Construct an MultiHeadedAttention object.
+        Args:
+            n_head (int): The number of heads.
+            n_feat (int): The number of features.
+            dropout_rate (float): Dropout rate.
+        """
+        super().__init__()
+        assert n_feat % n_head == 0
+        # We assume d_v always equals d_k
+        self.d_k = n_feat // n_head
+        self.h = n_head
+        self.linear_q = nn.Linear(n_feat, n_feat)
+        self.linear_k = nn.Linear(n_feat, n_feat)
+        self.linear_v = nn.Linear(n_feat, n_feat)
+        self.linear_out = nn.Linear(n_feat, n_feat)
+        self.dropout = nn.Dropout(p=dropout_rate)
+
+    def forward_qkv(self,
+                    query: paddle.Tensor,
+                    key: paddle.Tensor,
+                    value: paddle.Tensor
+                    ) -> Tuple[paddle.Tensor, paddle.Tensor, paddle.Tensor]:
+        """Transform query, key and value.
+        Args:
+            query (paddle.Tensor): Query tensor (#batch, time1, size).
+            key (paddle.Tensor): Key tensor (#batch, time2, size).
+            value (paddle.Tensor): Value tensor (#batch, time2, size).
+        Returns:
+            paddle.Tensor: Transformed query tensor, size
+                (#batch, n_head, time1, d_k).
+            paddle.Tensor: Transformed key tensor, size
+                (#batch, n_head, time2, d_k).
+            paddle.Tensor: Transformed value tensor, size
+                (#batch, n_head, time2, d_k).
+        """
+        n_batch = query.size(0)
+        q = self.linear_q(query).view(n_batch, -1, self.h, self.d_k)
+        k = self.linear_k(key).view(n_batch, -1, self.h, self.d_k)
+        v = self.linear_v(value).view(n_batch, -1, self.h, self.d_k)
+        q = q.transpose([0, 2, 1, 3])  # (batch, head, time1, d_k)
+        k = k.transpose([0, 2, 1, 3])  # (batch, head, time2, d_k)
+        v = v.transpose([0, 2, 1, 3])  # (batch, head, time2, d_k)
+
+        return q, k, v
+
+    def forward_attention(self,
+                          value: paddle.Tensor,
+                          scores: paddle.Tensor,
+                          mask: Optional[paddle.Tensor]) -> paddle.Tensor:
+        """Compute attention context vector.
+        Args:
+            value (paddle.Tensor): Transformed value, size
+                (#batch, n_head, time2, d_k).
+            scores (paddle.Tensor): Attention score, size
+                (#batch, n_head, time1, time2).
+            mask (paddle.Tensor): Mask, size (#batch, 1, time2) or
+                (#batch, time1, time2).
+        Returns:
+            paddle.Tensor: Transformed value weighted 
+                by the attention score, (#batch, time1, d_model).
+        """
+        n_batch = value.size(0)
+        if mask is not None:
+            mask = mask.unsqueeze(1).eq(0)  # (batch, 1, *, time2)
+            scores = scores.masked_fill(mask, -float('inf'))
+            attn = paddle.softmax(
+                scores, axis=-1).masked_fill(mask,
+                                             0.0)  # (batch, head, time1, time2)
+        else:
+            attn = paddle.softmax(
+                scores, axis=-1)  # (batch, head, time1, time2)
+
+        p_attn = self.dropout(attn)
+        x = paddle.matmul(p_attn, value)  # (batch, head, time1, d_k)
+        x = x.transpose([0, 2, 1, 3]).contiguous().view(
+            n_batch, -1, self.h * self.d_k)  # (batch, time1, d_model)
+
+        return self.linear_out(x)  # (batch, time1, d_model)
+
+    def forward(self,
+                query: paddle.Tensor,
+                key: paddle.Tensor,
+                value: paddle.Tensor,
+                mask: Optional[paddle.Tensor]) -> paddle.Tensor:
+        """Compute scaled dot product attention.
+        Args:
+            query (torch.Tensor): Query tensor (#batch, time1, size).
+            key (torch.Tensor): Key tensor (#batch, time2, size).
+            value (torch.Tensor): Value tensor (#batch, time2, size).
+            mask (torch.Tensor): Mask tensor (#batch, 1, time2) or
+                (#batch, time1, time2).
+        Returns:
+            torch.Tensor: Output tensor (#batch, time1, d_model).
+        """
+        q, k, v = self.forward_qkv(query, key, value)
+        scores = paddle.matmul(q,
+                               k.transpose([0, 1, 3, 2])) / math.sqrt(self.d_k)
+        return self.forward_attention(v, scores, mask)
+
+
+class RelPositionMultiHeadedAttention(MultiHeadedAttention):
+    """Multi-Head Attention layer with relative position encoding."""
+
+    def __init__(self, n_head, n_feat, dropout_rate):
+        """Construct an RelPositionMultiHeadedAttention object.
+        Paper: https://arxiv.org/abs/1901.02860
+        Args:
+            n_head (int): The number of heads.
+            n_feat (int): The number of features.
+            dropout_rate (float): Dropout rate.
+        """
+        super().__init__(n_head, n_feat, dropout_rate)
+        # linear transformation for positional encoding
+        self.linear_pos = nn.Linear(n_feat, n_feat, bias_attr=False)
+        # these two learnable bias are used in matrix c and matrix d
+        # as described in https://arxiv.org/abs/1901.02860 Section 3.3
+        #self.pos_bias_u = nn.Parameter(torch.Tensor(self.h, self.d_k))
+        #self.pos_bias_v = nn.Parameter(torch.Tensor(self.h, self.d_k))
+        #torch.nn.init.xavier_uniform_(self.pos_bias_u)
+        #torch.nn.init.xavier_uniform_(self.pos_bias_v)
+        pos_bias_u = self.create_parameter(
+            [self.h, self.d_k], default_initializer=I.XavierUniform())
+        self.add_parameter('pos_bias_u', pos_bias_u)
+        pos_bias_v = self.create_parameter(
+            (self.h, self.d_k), default_initializer=I.XavierUniform())
+        self.add_parameter('pos_bias_v', pos_bias_v)
+
+    def rel_shift(self, x, zero_triu: bool=False):
+        """Compute relative positinal encoding.
+        Args:
+            x (paddle.Tensor): Input tensor (batch, head, time1, time1).
+            zero_triu (bool): If true, return the lower triangular part of
+                the matrix.
+        Returns:
+            paddle.Tensor: Output tensor. (batch, head, time1, time1)
+        """
+        zero_pad = paddle.zeros(
+            (x.size(0), x.size(1), x.size(2), 1), dtype=x.dtype)
+        x_padded = paddle.cat([zero_pad, x], dim=-1)
+
+        x_padded = x_padded.view(x.size(0), x.size(1), x.size(3) + 1, x.size(2))
+        x = x_padded[:, :, 1:].view_as(x)  # [B, H, T1, T1]
+
+        if zero_triu:
+            ones = paddle.ones((x.size(2), x.size(3)))
+            x = x * paddle.tril(ones, x.size(3) - x.size(2))[None, None, :, :]
+
+        return x
+
+    def forward(self,
+                query: paddle.Tensor,
+                key: paddle.Tensor,
+                value: paddle.Tensor,
+                pos_emb: paddle.Tensor,
+                mask: Optional[paddle.Tensor]):
+        """Compute 'Scaled Dot Product Attention' with rel. positional encoding.
+        Args:
+            query (paddle.Tensor): Query tensor (#batch, time1, size).
+            key (paddle.Tensor): Key tensor (#batch, time2, size).
+            value (paddle.Tensor): Value tensor (#batch, time2, size).
+            pos_emb (paddle.Tensor): Positional embedding tensor
+                (#batch, time1, size).
+            mask (paddle.Tensor): Mask tensor (#batch, 1, time2) or
+                (#batch, time1, time2).
+        Returns:
+            paddle.Tensor: Output tensor (#batch, time1, d_model).
+        """
+        q, k, v = self.forward_qkv(query, key, value)
+        q = q.transpose([0, 2, 1, 3])  # (batch, time1, head, d_k)
+
+        n_batch_pos = pos_emb.size(0)
+        p = self.linear_pos(pos_emb).view(n_batch_pos, -1, self.h, self.d_k)
+        p = p.transpose([0, 2, 1, 3])  # (batch, head, time1, d_k)
+
+        # (batch, head, time1, d_k)
+        q_with_bias_u = (q + self.pos_bias_u).transpose([0, 2, 1, 3])
+        # (batch, head, time1, d_k)
+        q_with_bias_v = (q + self.pos_bias_v).transpose([0, 2, 1, 3])
+
+        # compute attention score
+        # first compute matrix a and matrix c
+        # as described in https://arxiv.org/abs/1901.02860 Section 3.3
+        # (batch, head, time1, time2)
+        matrix_ac = paddle.matmul(q_with_bias_u, k.transpose([0, 1, 3, 2]))
+
+        # compute matrix b and matrix d
+        # (batch, head, time1, time2)
+        matrix_bd = paddle.matmul(q_with_bias_v, p.transpose([0, 1, 3, 2]))
+        # Remove rel_shift since it is useless in speech recognition,
+        # and it requires special attention for streaming.
+        # matrix_bd = self.rel_shift(matrix_bd)
+
+        scores = (matrix_ac + matrix_bd) / math.sqrt(
+            self.d_k)  # (batch, head, time1, time2)
+
+        return self.forward_attention(v, scores, mask)
--- a/deepspeech/modules/cmvn.py
+++ b/deepspeech/modules/cmvn.py
@ -0,0 +1,51 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import paddle
+from paddle import nn
+
+from deepspeech.utils.log import Log
+
+logger = Log(__name__).getlog()
+
+__all__ = ['GlobalCMVN']
+
+
+class GlobalCMVN(nn.Layer):
+    def __init__(self,
+                 mean: paddle.Tensor,
+                 istd: paddle.Tensor,
+                 norm_var: bool=True):
+        """
+        Args:
+            mean (paddle.Tensor): mean stats
+            istd (paddle.Tensor): inverse std, std which is 1.0 / std
+        """
+        super().__init__()
+        assert mean.shape == istd.shape
+        self.norm_var = norm_var
+        # The buffer can be accessed from this module using self.mean
+        self.register_buffer("mean", mean)
+        self.register_buffer("istd", istd)
+
+    def forward(self, x: paddle.Tensor):
+        """
+        Args:
+            x (paddle.Tensor): (batch, max_len, feat_dim)
+        Returns:
+            (paddle.Tensor): normalized feature
+        """
+        x = x - self.mean
+        if self.norm_var:
+            x = x * self.istd
+        return x
--- a/deepspeech/modules/conformer_convolution.py
+++ b/deepspeech/modules/conformer_convolution.py
@ -0,0 +1,161 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""ConvolutionModule definition."""
+from typing import Optional
+from typing import Tuple
+
+import paddle
+from paddle import nn
+from typeguard import check_argument_types
+
+from deepspeech.utils.log import Log
+
+logger = Log(__name__).getlog()
+
+__all__ = ['ConvolutionModule']
+
+
+class ConvolutionModule(nn.Layer):
+    """ConvolutionModule in Conformer model."""
+
+    def __init__(self,
+                 channels: int,
+                 kernel_size: int=15,
+                 activation: nn.Layer=nn.ReLU(),
+                 norm: str="batch_norm",
+                 causal: bool=False,
+                 bias: bool=True):
+        """Construct an ConvolutionModule object.
+        Args:
+            channels (int): The number of channels of conv layers.
+            kernel_size (int): Kernel size of conv layers.
+            activation (nn.Layer): Activation Layer.
+            norm (str): Normalization type, 'batch_norm' or 'layer_norm'
+            causal (bool): Whether use causal convolution or not
+            bias (bool): Whether Conv with bias or not
+        """
+        assert check_argument_types()
+        super().__init__()
+        self.pointwise_conv1 = nn.Conv1D(
+            channels,
+            2 * channels,
+            kernel_size=1,
+            stride=1,
+            padding=0,
+            bias_attr=None
+            if bias else False,  # None for True, using bias as default config
+        )
+
+        # self.lorder is used to distinguish if it's a causal convolution,
+        # if self.lorder > 0: 
+        #    it's a causal convolution, the input will be padded with 
+        #    `self.lorder` frames on the left in forward (causal conv impl).
+        # else: it's a symmetrical convolution
+        if causal:
+            padding = 0
+            self.lorder = kernel_size - 1
+        else:
+            # kernel_size should be an odd number for none causal convolution
+            assert (kernel_size - 1) % 2 == 0
+            padding = (kernel_size - 1) // 2
+            self.lorder = 0
+
+        self.depthwise_conv = nn.Conv1D(
+            channels,
+            channels,
+            kernel_size,
+            stride=1,
+            padding=padding,
+            groups=channels,
+            bias_attr=None
+            if bias else False,  # None for True, using bias as default config
+        )
+
+        assert norm in ['batch_norm', 'layer_norm']
+        if norm == "batch_norm":
+            self.use_layer_norm = False
+            self.norm = nn.BatchNorm1D(channels)
+        else:
+            self.use_layer_norm = True
+            self.norm = nn.LayerNorm(channels)
+
+        self.pointwise_conv2 = nn.Conv1D(
+            channels,
+            channels,
+            kernel_size=1,
+            stride=1,
+            padding=0,
+            bias_attr=None
+            if bias else False,  # None for True, using bias as default config
+        )
+        self.activation = activation
+
+    def forward(self,
+                x: paddle.Tensor,
+                mask_pad: Optional[paddle.Tensor]=None,
+                cache: Optional[paddle.Tensor]=None
+                ) -> Tuple[paddle.Tensor, paddle.Tensor]:
+        """Compute convolution module.
+        Args:
+            x (paddle.Tensor): Input tensor (#batch, time, channels).
+            mask_pad (paddle.Tensor): used for batch padding, (#batch, channels, time).
+            cache (paddle.Tensor): left context cache, it is only
+                used in causal convolution. (#batch, channels, time')
+        Returns:
+            paddle.Tensor: Output tensor (#batch, time, channels).
+            paddle.Tensor: Output cache tensor (#batch, channels, time')
+        """
+        # exchange the temporal dimension and the feature dimension
+        x = x.transpose([0, 2, 1])  # [B, C, T]
+
+        # mask batch padding
+        if mask_pad is not None:
+            x = x.masked_fill(mask_pad, 0.0)
+
+        if self.lorder > 0:
+            if cache is None:
+                x = nn.functional.pad(
+                    x, (self.lorder, 0), 'constant', 0.0, data_format='NCL')
+            else:
+                assert cache.shape[0] == x.shape[0]  # B
+                assert cache.shape[1] == x.shape[1]  # C
+                x = paddle.concat((cache, x), axis=2)
+
+            assert (x.shape[2] > self.lorder)
+            new_cache = x[:, :, -self.lorder:]  #[B, C, T]
+        else:
+            # It's better we just return None if no cache is requried,
+            # However, for JIT export, here we just fake one tensor instead of
+            # None.
+            new_cache = paddle.zeros([1], dtype=x.dtype)
+
+        # GLU mechanism
+        x = self.pointwise_conv1(x)  # (batch, 2*channel, dim)
+        x = nn.functional.glu(x, axis=1)  # (batch, channel, dim)
+
+        # 1D Depthwise Conv
+        x = self.depthwise_conv(x)
+        if self.use_layer_norm:
+            x = x.transpose([0, 2, 1])  # [B, T, C]
+        x = self.activation(self.norm(x))
+        if self.use_layer_norm:
+            x = x.transpose([0, 2, 1])  # [B, C, T]
+        x = self.pointwise_conv2(x)
+
+        # mask batch padding
+        if mask_pad is not None:
+            x = x.masked_fill(mask_pad, 0.0)
+
+        x = x.transpose([0, 2, 1])  # [B, T, C]
+        return x, new_cache
--- a/deepspeech/modules/conv.py
+++ b/deepspeech/modules/conv.py
@ -11,20 +11,41 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
-
-import logging
-
-import paddle
 from paddle import nn
 from paddle.nn import functional as F
-from paddle.nn import initializer as I

-from deepspeech.modules.mask import sequence_mask
 from deepspeech.modules.activation import brelu
+from deepspeech.modules.mask import sequence_mask
+from deepspeech.utils.log import Log
+
+logger = Log(__name__).getlog()
+
+__all__ = ['ConvStack', "conv_output_size"]
+
+
+def conv_output_size(I, F, P, S):
+    # https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-convolutional-neural-networks#hyperparameters
+    # Output size after Conv:
+    #   By noting I the length of the input volume size, 
+    #   F the length of the filter, 
+    #   P the amount of zero padding, 
+    #   S the stride,
+    #   then the output size O of the feature map along that dimension is given by:
+    #       O = (I - F + Pstart + Pend) // S + 1
+    #   When Pstart == Pend == P, we can replace Pstart + Pend by 2P.
+    #   When Pstart == Pend == 0
+    #       O = (I - F - S) // S
+    # https://iq.opengenus.org/output-size-of-convolution/
+    # Output height = (Input height + padding height top + padding height bottom - kernel height) / (stride height) + 1
+    # Output width = (Output width + padding width right + padding width left - kernel width) / (stride width) + 1
+    return (I - F + 2 * P - S) // S

-logger = logging.getLogger(__name__)

-__all__ = ['ConvStack']
+# receptive field calculator
+# https://fomoro.com/research/article/receptive-field-calculator
+# https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-convolutional-neural-networks#hyperparameters
+# https://distill.pub/2019/computing-receptive-fields/
+# Rl-1 = Sl * Rl + (Kl - Sl) 


 class ConvBn(nn.Layer):
@ -120,7 +141,7 @@ class ConvStack(nn.Layer):
            act='brelu')

        out_channel = 32
-        self.conv_stack = nn.LayerList([
+        convs = [
            ConvBn(
                num_channels_in=32,
                num_channels_out=out_channel,
@ -128,7 +149,8 @@ class ConvStack(nn.Layer):
                stride=(2, 1),
                padding=(10, 5),
                act='brelu') for i in range(num_stacks - 1)
-        ])
+        ]
+        self.conv_stack = nn.LayerList(convs)

        # conv output feat_dim
        output_height = (feat_size - 1) // 2 + 1
--- a/deepspeech/modules/ctc.py
+++ b/deepspeech/modules/ctc.py
@ -11,38 +11,36 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
-
-import logging
-from typeguard import check_argument_types
-
 import paddle
 from paddle import nn
 from paddle.nn import functional as F
-from paddle.nn import initializer as I
+from typeguard import check_argument_types

-from deepspeech.decoders.swig_wrapper import Scorer
-from deepspeech.decoders.swig_wrapper import ctc_greedy_decoder
 from deepspeech.decoders.swig_wrapper import ctc_beam_search_decoder_batch
+from deepspeech.decoders.swig_wrapper import ctc_greedy_decoder
+from deepspeech.decoders.swig_wrapper import Scorer
 from deepspeech.modules.loss import CTCLoss
+from deepspeech.utils import ctc_utils
+from deepspeech.utils.log import Log

-logger = logging.getLogger(__name__)
+logger = Log(__name__).getlog()

 __all__ = ['CTCDecoder']


 class CTCDecoder(nn.Layer):
    def __init__(self,
-                 enc_n_units,
                 odim,
+                 enc_n_units,
                 blank_id=0,
                 dropout_rate: float=0.0,
                 reduction: bool=True,
-                 batch_average: bool=False):
+                 batch_average: bool=True):
        """CTC decoder

        Args:
+            odim ([int]): text vocabulary size
            enc_n_units ([int]): encoder output dimention
-            vocab_size ([int]): text vocabulary size
            dropout_rate (float): dropout rate (0.0 ~ 1.0)
            reduction (bool): reduce the CTC loss into a scalar, True for 'sum' or 'none'
            batch_average (bool): do batch dim wise average.
@ -72,38 +70,31 @@ class CTCDecoder(nn.Layer):
            ys_pad (Tenosr): batch of padded character id sequence tensor (B, Lmax)
            ys_lens (Tensor): batch of lengths of character sequence (B)
        Returns:
-            loss (Tenosr): scalar.
+            loss (Tenosr): ctc loss value, scalar.
        """
        logits = self.ctc_lo(F.dropout(hs_pad, p=self.dropout_rate))
        loss = self.criterion(logits, ys_pad, hlens, ys_lens)
        return loss

-    def probs(self, eouts: paddle.Tensor, temperature: float=1.0):
+    def softmax(self, eouts: paddle.Tensor, temperature: float=1.0):
        """Get CTC probabilities.
        Args:
            eouts (FloatTensor): `[B, T, enc_units]`
        Returns:
            probs (FloatTensor): `[B, T, odim]`
        """
-        return F.softmax(self.ctc_lo(eouts) / temperature, axis=-1)
+        self.probs = F.softmax(self.ctc_lo(eouts) / temperature, axis=2)
+        return self.probs

-    def scores(self, eouts: paddle.Tensor, temperature: float=1.0):
-        """Get log-scale CTC probabilities.
-        Args:
-            eouts (FloatTensor): `[B, T, enc_units]`
-        Returns:
-            log_probs (FloatTensor): `[B, T, odim]`
-        """
-        return F.log_softmax(self.ctc_lo(eouts) / temperature, axis=-1)
-
-    def log_softmax(self, hs_pad: paddle.Tensor) -> paddle.Tensor:
+    def log_softmax(self, hs_pad: paddle.Tensor,
+                    temperature: float=1.0) -> paddle.Tensor:
        """log_softmax of frame activations
        Args:
            Tensor hs_pad: 3d tensor (B, Tmax, eprojs)
        Returns:
            paddle.Tensor: log softmax applied 3d tensor (B, Tmax, odim)
        """
-        return self.scores(hs_pad)
+        return F.log_softmax(self.ctc_lo(hs_pad) / temperature, axis=2)

    def argmax(self, hs_pad: paddle.Tensor) -> paddle.Tensor:
        """argmax of frame activations
@ -114,6 +105,20 @@ class CTCDecoder(nn.Layer):
        """
        return paddle.argmax(self.ctc_lo(hs_pad), dim=2)

+    def forced_align(self,
+                     ctc_probs: paddle.Tensor,
+                     y: paddle.Tensor,
+                     blank_id=0) -> list:
+        """ctc forced alignment.
+        Args:
+            ctc_probs (paddle.Tensor): hidden state sequence, 2d tensor (T, D)
+            y (paddle.Tensor): label id sequence tensor, 1d tensor (L)
+            blank_id (int): blank symbol index
+        Returns:
+            paddle.Tensor: best alignment result, (T).
+        """
+        return ctc_utils.forced_align(ctc_probs, y, blank_id)
+
    def _decode_batch_greedy(self, probs_split, vocab_list):
        """Decode by best path for a batch of probs matrix input.
        :param probs_split: List of 2-D probability matrix, and each consists
@ -147,7 +152,7 @@ class CTCDecoder(nn.Layer):
        :type vocab_list: list
        """
        # init once
-        if self._ext_scorer != None:
+        if self._ext_scorer is not None:
            return

        if language_model_path != '':
@ -195,7 +200,7 @@ class CTCDecoder(nn.Layer):
        :return: List of transcription texts.
        :rtype: List of str
        """
-        if self._ext_scorer != None:
+        if self._ext_scorer is not None:
            self._ext_scorer.reset_params(beam_alpha, beam_beta)

        # beam search decode
@ -221,9 +226,28 @@ class CTCDecoder(nn.Layer):
    def decode_probs(self, probs, logits_lens, vocab_list, decoding_method,
                     lang_model_path, beam_alpha, beam_beta, beam_size,
                     cutoff_prob, cutoff_top_n, num_processes):
-        """ probs: activation after softmax 
-        logits_len: audio output lens
+        """ctc decoding with probs.
+
+        Args:
+            probs (Tenosr): activation after softmax 
+            logits_lens (Tenosr): audio output lens
+            vocab_list ([type]): [description]
+            decoding_method ([type]): [description]
+            lang_model_path ([type]): [description]
+            beam_alpha ([type]): [description]
+            beam_beta ([type]): [description]
+            beam_size ([type]): [description]
+            cutoff_prob ([type]): [description]
+            cutoff_top_n ([type]): [description]
+            num_processes ([type]): [description]
+
+        Raises:
+            ValueError: when decoding_method not support.
+
+        Returns:
+            List[str]: transcripts.
        """
+
        probs_split = [probs[i, :l, :] for i, l in enumerate(logits_lens)]
        if decoding_method == "ctc_greedy":
            result_transcripts = self._decode_batch_greedy(
--- a/deepspeech/modules/decoder.py
+++ b/deepspeech/modules/decoder.py
@ -0,0 +1,182 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Decoder definition."""
+from typing import List
+from typing import Optional
+from typing import Tuple
+
+import paddle
+from paddle import nn
+from typeguard import check_argument_types
+
+from deepspeech.modules.attention import MultiHeadedAttention
+from deepspeech.modules.decoder_layer import DecoderLayer
+from deepspeech.modules.embedding import PositionalEncoding
+from deepspeech.modules.mask import make_non_pad_mask
+from deepspeech.modules.mask import subsequent_mask
+from deepspeech.modules.positionwise_feed_forward import PositionwiseFeedForward
+from deepspeech.utils.log import Log
+
+logger = Log(__name__).getlog()
+
+__all__ = ["TransformerDecoder"]
+
+
+class TransformerDecoder(nn.Module):
+    """Base class of Transfomer decoder module.
+    Args:
+        vocab_size: output dim
+        encoder_output_size: dimension of attention
+        attention_heads: the number of heads of multi head attention
+        linear_units: the hidden units number of position-wise feedforward
+        num_blocks: the number of decoder blocks
+        dropout_rate: dropout rate
+        self_attention_dropout_rate: dropout rate for attention
+        input_layer: input layer type, `embed`
+        use_output_layer: whether to use output layer
+        pos_enc_class: PositionalEncoding module
+        normalize_before:
+            True: use layer_norm before each sub-block of a layer.
+            False: use layer_norm after each sub-block of a layer.
+        concat_after: whether to concat attention layer's input and output
+            True: x -> x + linear(concat(x, att(x)))
+            False: x -> x + att(x)
+    """
+
+    def __init__(
+            self,
+            vocab_size: int,
+            encoder_output_size: int,
+            attention_heads: int=4,
+            linear_units: int=2048,
+            num_blocks: int=6,
+            dropout_rate: float=0.1,
+            positional_dropout_rate: float=0.1,
+            self_attention_dropout_rate: float=0.0,
+            src_attention_dropout_rate: float=0.0,
+            input_layer: str="embed",
+            use_output_layer: bool=True,
+            normalize_before: bool=True,
+            concat_after: bool=False, ):
+
+        assert check_argument_types()
+        super().__init__()
+        attention_dim = encoder_output_size
+
+        if input_layer == "embed":
+            self.embed = nn.Sequential(
+                nn.Embedding(vocab_size, attention_dim),
+                PositionalEncoding(attention_dim, positional_dropout_rate), )
+        else:
+            raise ValueError(f"only 'embed' is supported: {input_layer}")
+
+        self.normalize_before = normalize_before
+        self.after_norm = nn.LayerNorm(attention_dim, epsilon=1e-12)
+        self.use_output_layer = use_output_layer
+        self.output_layer = nn.Linear(attention_dim, vocab_size)
+
+        self.decoders = nn.ModuleList([
+            DecoderLayer(
+                size=attention_dim,
+                self_attn=MultiHeadedAttention(attention_heads, attention_dim,
+                                               self_attention_dropout_rate),
+                src_attn=MultiHeadedAttention(attention_heads, attention_dim,
+                                              src_attention_dropout_rate),
+                feed_forward=PositionwiseFeedForward(
+                    attention_dim, linear_units, dropout_rate),
+                dropout_rate=dropout_rate,
+                normalize_before=normalize_before,
+                concat_after=concat_after, ) for _ in range(num_blocks)
+        ])
+
+    def forward(
+            self,
+            memory: paddle.Tensor,
+            memory_mask: paddle.Tensor,
+            ys_in_pad: paddle.Tensor,
+            ys_in_lens: paddle.Tensor, ) -> Tuple[paddle.Tensor, paddle.Tensor]:
+        """Forward decoder.
+        Args:
+            memory: encoded memory, float32  (batch, maxlen_in, feat)
+            memory_mask: encoder memory mask, (batch, 1, maxlen_in)
+            ys_in_pad: padded input token ids, int64 (batch, maxlen_out)
+            ys_in_lens: input lengths of this batch (batch)
+        Returns:
+            (tuple): tuple containing:
+                x: decoded token score before softmax (batch, maxlen_out, vocab_size)
+                    if use_output_layer is True,
+                olens: (batch, )
+        """
+        tgt = ys_in_pad
+        # tgt_mask: (B, 1, L)
+        tgt_mask = (make_non_pad_mask(ys_in_lens).unsqueeze(1))
+        # m: (1, L, L)
+        m = subsequent_mask(tgt_mask.size(-1)).unsqueeze(0)
+        # tgt_mask: (B, L, L)
+        # TODO(Hui Zhang): not support & for tensor
+        # tgt_mask = tgt_mask & m
+        tgt_mask = tgt_mask.logical_and(m)
+
+        x, _ = self.embed(tgt)
+        for layer in self.decoders:
+            x, tgt_mask, memory, memory_mask = layer(x, tgt_mask, memory,
+                                                     memory_mask)
+        if self.normalize_before:
+            x = self.after_norm(x)
+        if self.use_output_layer:
+            x = self.output_layer(x)
+
+        # TODO(Hui Zhang): reduce_sum not support bool type
+        # olens = tgt_mask.sum(1)
+        olens = tgt_mask.astype(paddle.int).sum(1)
+        return x, olens
+
+    def forward_one_step(
+            self,
+            memory: paddle.Tensor,
+            memory_mask: paddle.Tensor,
+            tgt: paddle.Tensor,
+            tgt_mask: paddle.Tensor,
+            cache: Optional[List[paddle.Tensor]]=None,
+    ) -> Tuple[paddle.Tensor, List[paddle.Tensor]]:
+        """Forward one step.
+            This is only used for decoding.
+        Args:
+            memory: encoded memory, float32  (batch, maxlen_in, feat)
+            memory_mask: encoded memory mask, (batch, 1, maxlen_in)
+            tgt: input token ids, int64 (batch, maxlen_out)
+            tgt_mask: input token mask,  (batch, maxlen_out, maxlen_out)
+                      dtype=paddle.bool
+            cache: cached output list of (batch, max_time_out-1, size)
+        Returns:
+            y, cache: NN output value and cache per `self.decoders`.
+                y.shape` is (batch, token)
+        """
+        x, _ = self.embed(tgt)
+        new_cache = []
+        for i, decoder in enumerate(self.decoders):
+            if cache is None:
+                c = None
+            else:
+                c = cache[i]
+            x, tgt_mask, memory, memory_mask = decoder(
+                x, tgt_mask, memory, memory_mask, cache=c)
+            new_cache.append(x)
+        if self.normalize_before:
+            y = self.after_norm(x[:, -1])
+        else:
+            y = x[:, -1]
+        if self.use_output_layer:
+            y = paddle.log_softmax(self.output_layer(y), axis=-1)
+        return y, new_cache
--- a/deepspeech/modules/decoder_layer.py
+++ b/deepspeech/modules/decoder_layer.py
@ -0,0 +1,151 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Decoder self-attention layer definition."""
+from typing import Optional
+from typing import Tuple
+
+import paddle
+from paddle import nn
+
+from deepspeech.utils.log import Log
+
+logger = Log(__name__).getlog()
+
+__all__ = ["DecoderLayer"]
+
+
+class DecoderLayer(nn.Module):
+    """Single decoder layer module.
+    Args:
+        size (int): Input dimension.
+        self_attn (nn.Module): Self-attention module instance.
+            `MultiHeadedAttention` instance can be used as the argument.
+        src_attn (nn.Module): Self-attention module instance.
+            `MultiHeadedAttention` instance can be used as the argument.
+        feed_forward (nn.Module): Feed-forward module instance.
+            `PositionwiseFeedForward` instance can be used as the argument.
+        dropout_rate (float): Dropout rate.
+        normalize_before (bool):
+            True: use layer_norm before each sub-block.
+            False: to use layer_norm after each sub-block.
+        concat_after (bool): Whether to concat attention layer's input
+            and output.
+            True: x -> x + linear(concat(x, att(x)))
+            False: x -> x + att(x)
+    """
+
+    def __init__(
+            self,
+            size: int,
+            self_attn: nn.Module,
+            src_attn: nn.Module,
+            feed_forward: nn.Module,
+            dropout_rate: float,
+            normalize_before: bool=True,
+            concat_after: bool=False, ):
+        """Construct an DecoderLayer object."""
+        super().__init__()
+        self.size = size
+        self.self_attn = self_attn
+        self.src_attn = src_attn
+        self.feed_forward = feed_forward
+        self.norm1 = nn.LayerNorm(size, epsilon=1e-12)
+        self.norm2 = nn.LayerNorm(size, epsilon=1e-12)
+        self.norm3 = nn.LayerNorm(size, epsilon=1e-12)
+        self.dropout = nn.Dropout(dropout_rate)
+        self.normalize_before = normalize_before
+        self.concat_after = concat_after
+        self.concat_linear1 = nn.Linear(size + size, size)
+        self.concat_linear2 = nn.Linear(size + size, size)
+
+    def forward(
+            self,
+            tgt: paddle.Tensor,
+            tgt_mask: paddle.Tensor,
+            memory: paddle.Tensor,
+            memory_mask: paddle.Tensor,
+            cache: Optional[paddle.Tensor]=None
+    ) -> Tuple[paddle.Tensor, paddle.Tensor, paddle.Tensor, paddle.Tensor]:
+        """Compute decoded features.
+        Args:
+            tgt (paddle.Tensor): Input tensor (#batch, maxlen_out, size).
+            tgt_mask (paddle.Tensor): Mask for input tensor
+                (#batch, maxlen_out).
+            memory (paddle.Tensor): Encoded memory
+                (#batch, maxlen_in, size).
+            memory_mask (paddle.Tensor): Encoded memory mask
+                (#batch, maxlen_in).
+            cache (paddle.Tensor): cached tensors.
+                (#batch, maxlen_out - 1, size).
+        Returns:
+            paddle.Tensor: Output tensor (#batch, maxlen_out, size).
+            paddle.Tensor: Mask for output tensor (#batch, maxlen_out).
+            paddle.Tensor: Encoded memory (#batch, maxlen_in, size).
+            paddle.Tensor: Encoded memory mask (#batch, maxlen_in).
+        """
+        residual = tgt
+        if self.normalize_before:
+            tgt = self.norm1(tgt)
+
+        if cache is None:
+            tgt_q = tgt
+            tgt_q_mask = tgt_mask
+        else:
+            # compute only the last frame query keeping dim: max_time_out -> 1
+            assert cache.shape == [
+                tgt.shape[0],
+                tgt.shape[1] - 1,
+                self.size,
+            ], f"{cache.shape} == {[tgt.shape[0], tgt.shape[1] - 1, self.size]}"
+            tgt_q = tgt[:, -1:, :]
+            residual = residual[:, -1:, :]
+            # TODO(Hui Zhang): slice not support bool type
+            # tgt_q_mask = tgt_mask[:, -1:, :]
+            tgt_q_mask = tgt_mask.cast(paddle.int64)[:, -1:, :].cast(
+                paddle.bool)
+
+        if self.concat_after:
+            tgt_concat = paddle.cat(
+                (tgt_q, self.self_attn(tgt_q, tgt, tgt, tgt_q_mask)), dim=-1)
+            x = residual + self.concat_linear1(tgt_concat)
+        else:
+            x = residual + self.dropout(
+                self.self_attn(tgt_q, tgt, tgt, tgt_q_mask))
+        if not self.normalize_before:
+            x = self.norm1(x)
+
+        residual = x
+        if self.normalize_before:
+            x = self.norm2(x)
+        if self.concat_after:
+            x_concat = paddle.cat(
+                (x, self.src_attn(x, memory, memory, memory_mask)), dim=-1)
+            x = residual + self.concat_linear2(x_concat)
+        else:
+            x = residual + self.dropout(
+                self.src_attn(x, memory, memory, memory_mask))
+        if not self.normalize_before:
+            x = self.norm2(x)
+
+        residual = x
+        if self.normalize_before:
+            x = self.norm3(x)
+        x = residual + self.dropout(self.feed_forward(x))
+        if not self.normalize_before:
+            x = self.norm3(x)
+
+        if cache is not None:
+            x = paddle.cat([cache, x], dim=1)
+
+        return x, tgt_mask, memory, memory_mask
--- a/deepspeech/modules/embedding.py
+++ b/deepspeech/modules/embedding.py
@ -12,23 +12,17 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 """Positonal Encoding Module."""
-
 import math
-import logging
-import numpy as np
 from typing import Tuple

 import paddle
 from paddle import nn
-from paddle.nn import functional as F
-from paddle.nn import initializer as I

-logger = logging.getLogger(__name__)
+from deepspeech.utils.log import Log

-__all__ = ["PositionalEncoding", "RelPositionalEncoding"]
+logger = Log(__name__).getlog()

-# TODO(Hui Zhang): remove this hack
-paddle.float32 = 'float32'
+__all__ = ["PositionalEncoding", "RelPositionalEncoding"]


 class PositionalEncoding(nn.Layer):
@ -51,10 +45,10 @@ class PositionalEncoding(nn.Layer):
        self.max_len = max_len
        self.xscale = paddle.to_tensor(math.sqrt(self.d_model))
        self.dropout = nn.Dropout(p=dropout_rate)
-        self.pe = paddle.zeros(self.max_len, self.d_model)  #[T,D]
+        self.pe = paddle.zeros([self.max_len, self.d_model])  #[T,D]

        position = paddle.arange(
-            0, self.max_len, dtype=paddle.float32).unsqueeze(1)
+            0, self.max_len, dtype=paddle.float32).unsqueeze(1)  #[T, 1]
        div_term = paddle.exp(
            paddle.arange(0, self.d_model, 2, dtype=paddle.float32) *
            -(math.log(10000.0) / self.d_model))
@ -71,13 +65,11 @@ class PositionalEncoding(nn.Layer):
            offset (int): position offset
        Returns:
            paddle.Tensor: Encoded tensor. Its shape is (batch, time, ...)
-            paddle.Tensor: for compatibility to RelPositionalEncoding
+            paddle.Tensor: for compatibility to RelPositionalEncoding, (batch=1, time, ...)
        """
-        T = paddle.shape(x)[1]
-        assert offset + T < self.max_len
-        #assert offset + x.size(1) < self.max_len
-        #self.pe = self.pe.to(x.device)
-        #pos_emb = self.pe[:, offset:offset + x.size(1)]
+        T = x.shape[1]
+        assert offset + x.size(1) < self.max_len
+        #TODO(Hui Zhang): using T = x.size(1), __getitem__ not support Tensor
        pos_emb = self.pe[:, offset:offset + T]
        x = x * self.xscale + pos_emb
        return self.dropout(x), self.dropout(pos_emb)
@ -122,11 +114,8 @@ class RelPositionalEncoding(PositionalEncoding):
            paddle.Tensor: Encoded tensor (batch, time, `*`).
            paddle.Tensor: Positional embedding tensor (1, time, `*`).
        """
-        T = paddle.shape()[1]
-        assert offset + T < self.max_len
-        #assert offset + x.size(1) < self.max_len
-        #self.pe = self.pe.to(x.device)
+        assert offset + x.size(1) < self.max_len
        x = x * self.xscale
-        #pos_emb = self.pe[:, offset:offset + x.size(1)]
-        pos_emb = self.pe[:, offset:offset + T]
+        #TODO(Hui Zhang): using x.size(1), __getitem__ not support Tensor
+        pos_emb = self.pe[:, offset:offset + x.shape[1]]
        return self.dropout(x), self.dropout(pos_emb)
--- a/deepspeech/modules/encoder.py
+++ b/deepspeech/modules/encoder.py
@ -0,0 +1,448 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Encoder definition."""
+from typing import List
+from typing import Optional
+from typing import Tuple
+
+import paddle
+from paddle import nn
+from typeguard import check_argument_types
+
+from deepspeech.modules.activation import get_activation
+from deepspeech.modules.attention import MultiHeadedAttention
+from deepspeech.modules.attention import RelPositionMultiHeadedAttention
+from deepspeech.modules.conformer_convolution import ConvolutionModule
+from deepspeech.modules.embedding import PositionalEncoding
+from deepspeech.modules.embedding import RelPositionalEncoding
+from deepspeech.modules.encoder_layer import ConformerEncoderLayer
+from deepspeech.modules.encoder_layer import TransformerEncoderLayer
+from deepspeech.modules.mask import add_optional_chunk_mask
+from deepspeech.modules.mask import make_non_pad_mask
+from deepspeech.modules.positionwise_feed_forward import PositionwiseFeedForward
+from deepspeech.modules.subsampling import Conv2dSubsampling4
+from deepspeech.modules.subsampling import Conv2dSubsampling6
+from deepspeech.modules.subsampling import Conv2dSubsampling8
+from deepspeech.modules.subsampling import LinearNoSubsampling
+from deepspeech.utils.log import Log
+
+logger = Log(__name__).getlog()
+
+__all__ = ["BaseEncoder", 'TransformerEncoder', "ConformerEncoder"]
+
+
+class BaseEncoder(nn.Layer):
+    def __init__(
+            self,
+            input_size: int,
+            output_size: int=256,
+            attention_heads: int=4,
+            linear_units: int=2048,
+            num_blocks: int=6,
+            dropout_rate: float=0.1,
+            positional_dropout_rate: float=0.1,
+            attention_dropout_rate: float=0.0,
+            input_layer: str="conv2d",
+            pos_enc_layer_type: str="abs_pos",
+            normalize_before: bool=True,
+            concat_after: bool=False,
+            static_chunk_size: int=0,
+            use_dynamic_chunk: bool=False,
+            global_cmvn: paddle.nn.Layer=None,
+            use_dynamic_left_chunk: bool=False, ):
+        """
+        Args:
+            input_size (int): input dim, d_feature
+            output_size (int): dimension of attention, d_model
+            attention_heads (int): the number of heads of multi head attention
+            linear_units (int): the hidden units number of position-wise feed
+                forward
+            num_blocks (int): the number of encoder blocks
+            dropout_rate (float): dropout rate
+            attention_dropout_rate (float): dropout rate in attention
+            positional_dropout_rate (float): dropout rate after adding
+                positional encoding
+            input_layer (str): input layer type.
+                optional [linear, conv2d, conv2d6, conv2d8]
+            pos_enc_layer_type (str): Encoder positional encoding layer type.
+                opitonal [abs_pos, scaled_abs_pos, rel_pos]
+            normalize_before (bool):
+                True: use layer_norm before each sub-block of a layer.
+                False: use layer_norm after each sub-block of a layer.
+            concat_after (bool): whether to concat attention layer's input
+                and output.
+                True: x -> x + linear(concat(x, att(x)))
+                False: x -> x + att(x)
+            static_chunk_size (int): chunk size for static chunk training and
+                decoding
+            use_dynamic_chunk (bool): whether use dynamic chunk size for
+                training or not, You can only use fixed chunk(chunk_size > 0)
+                or dyanmic chunk size(use_dynamic_chunk = True)
+            global_cmvn (Optional[paddle.nn.Layer]): Optional GlobalCMVN layer
+            use_dynamic_left_chunk (bool): whether use dynamic left chunk in
+                dynamic chunk training
+        """
+        assert check_argument_types()
+        super().__init__()
+        self._output_size = output_size
+
+        if pos_enc_layer_type == "abs_pos":
+            pos_enc_class = PositionalEncoding
+        elif pos_enc_layer_type == "rel_pos":
+            pos_enc_class = RelPositionalEncoding
+        else:
+            raise ValueError("unknown pos_enc_layer: " + pos_enc_layer_type)
+
+        if input_layer == "linear":
+            subsampling_class = LinearNoSubsampling
+        elif input_layer == "conv2d":
+            subsampling_class = Conv2dSubsampling4
+        elif input_layer == "conv2d6":
+            subsampling_class = Conv2dSubsampling6
+        elif input_layer == "conv2d8":
+            subsampling_class = Conv2dSubsampling8
+        else:
+            raise ValueError("unknown input_layer: " + input_layer)
+
+        self.global_cmvn = global_cmvn
+        self.embed = subsampling_class(
+            idim=input_size,
+            odim=output_size,
+            dropout_rate=dropout_rate,
+            pos_enc_class=pos_enc_class(
+                d_model=output_size, dropout_rate=positional_dropout_rate), )
+
+        self.normalize_before = normalize_before
+        self.after_norm = nn.LayerNorm(output_size, epsilon=1e-12)
+        self.static_chunk_size = static_chunk_size
+        self.use_dynamic_chunk = use_dynamic_chunk
+        self.use_dynamic_left_chunk = use_dynamic_left_chunk
+
+    def output_size(self) -> int:
+        return self._output_size
+
+    def forward(
+            self,
+            xs: paddle.Tensor,
+            xs_lens: paddle.Tensor,
+            decoding_chunk_size: int=0,
+            num_decoding_left_chunks: int=-1,
+    ) -> Tuple[paddle.Tensor, paddle.Tensor]:
+        """Embed positions in tensor.
+        Args:
+            xs: padded input tensor (B, L, D)
+            xs_lens: input length (B)
+            decoding_chunk_size: decoding chunk size for dynamic chunk
+                0: default for training, use random dynamic chunk.
+                <0: for decoding, use full chunk.
+                >0: for decoding, use fixed chunk size as set.
+            num_decoding_left_chunks: number of left chunks, this is for decoding,
+                the chunk size is decoding_chunk_size.
+                >=0: use num_decoding_left_chunks
+                <0: use all left chunks
+        Returns:
+            encoder output tensor, lens and mask
+        """
+        masks = make_non_pad_mask(xs_lens).unsqueeze(1)  # (B, 1, L)
+
+        if self.global_cmvn is not None:
+            xs = self.global_cmvn(xs)
+        #TODO(Hui Zhang): self.embed(xs, masks, offset=0), stride_slice not support bool tensor
+        xs, pos_emb, masks = self.embed(xs, masks.type_as(xs), offset=0)
+        #TODO(Hui Zhang): remove mask.astype, stride_slice not support bool tensor
+        masks = masks.astype(paddle.bool)
+        #TODO(Hui Zhang): mask_pad = ~masks
+        mask_pad = masks.logical_not()
+        chunk_masks = add_optional_chunk_mask(
+            xs, masks, self.use_dynamic_chunk, self.use_dynamic_left_chunk,
+            decoding_chunk_size, self.static_chunk_size,
+            num_decoding_left_chunks)
+        for layer in self.encoders:
+            xs, chunk_masks, _ = layer(xs, chunk_masks, pos_emb, mask_pad)
+        if self.normalize_before:
+            xs = self.after_norm(xs)
+        # Here we assume the mask is not changed in encoder layers, so just
+        # return the masks before encoder layers, and the masks will be used
+        # for cross attention with decoder later
+        return xs, masks
+
+    def forward_chunk(
+            self,
+            xs: paddle.Tensor,
+            offset: int,
+            required_cache_size: int,
+            subsampling_cache: Optional[paddle.Tensor]=None,
+            elayers_output_cache: Optional[List[paddle.Tensor]]=None,
+            conformer_cnn_cache: Optional[List[paddle.Tensor]]=None,
+    ) -> Tuple[paddle.Tensor, paddle.Tensor, List[paddle.Tensor], List[
+            paddle.Tensor]]:
+        """ Forward just one chunk
+        Args:
+            xs (paddle.Tensor): chunk input, [B=1, T, D]
+            offset (int): current offset in encoder output time stamp
+            required_cache_size (int): cache size required for next chunk
+                compuation
+                >=0: actual cache size
+                <0: means all history cache is required
+            subsampling_cache (Optional[paddle.Tensor]): subsampling cache
+            elayers_output_cache (Optional[List[paddle.Tensor]]):
+                transformer/conformer encoder layers output cache
+            conformer_cnn_cache (Optional[List[paddle.Tensor]]): conformer
+                cnn cache
+        Returns:
+            paddle.Tensor: output of current input xs
+            paddle.Tensor: subsampling cache required for next chunk computation
+            List[paddle.Tensor]: encoder layers output cache required for next
+                chunk computation
+            List[paddle.Tensor]: conformer cnn cache
+        """
+        assert xs.size(0) == 1  # batch size must be one
+        # tmp_masks is just for interface compatibility
+        tmp_masks = paddle.ones([1, xs.size(1)], dtype=paddle.bool)
+        tmp_masks = tmp_masks.unsqueeze(1)  #[B=1, C=1, T]
+
+        if self.global_cmvn is not None:
+            xs = self.global_cmvn(xs)
+
+        xs, pos_emb, _ = self.embed(
+            xs, tmp_masks, offset=offset)  #xs=(B, T, D), pos_emb=(B=1, T, D)
+        if subsampling_cache is not None:
+            cache_size = subsampling_cache.size(1)  #T
+            xs = paddle.cat((subsampling_cache, xs), dim=1)
+        else:
+            cache_size = 0
+        pos_emb = self.embed.position_encoding(
+            offset=offset - cache_size, size=xs.size(1))
+
+        if required_cache_size < 0:
+            next_cache_start = 0
+        elif required_cache_size == 0:
+            next_cache_start = xs.size(1)
+        else:
+            next_cache_start = xs.size(1) - required_cache_size
+        r_subsampling_cache = xs[:, next_cache_start:, :]
+
+        # Real mask for transformer/conformer layers
+        masks = paddle.ones([1, xs.size(1)], dtype=paddle.bool)
+        masks = masks.unsqueeze(1)  #[B=1, C=1, T]
+        r_elayers_output_cache = []
+        r_conformer_cnn_cache = []
+        for i, layer in enumerate(self.encoders):
+            attn_cache = None if elayers_output_cache is None else elayers_output_cache[
+                i]
+            cnn_cache = None if conformer_cnn_cache is None else conformer_cnn_cache[
+                i]
+            xs, _, new_cnn_cache = layer(
+                xs,
+                masks,
+                pos_emb,
+                output_cache=attn_cache,
+                cnn_cache=cnn_cache)
+            r_elayers_output_cache.append(xs[:, next_cache_start:, :])
+            r_conformer_cnn_cache.append(new_cnn_cache)
+        if self.normalize_before:
+            xs = self.after_norm(xs)
+
+        return (xs[:, cache_size:, :], r_subsampling_cache,
+                r_elayers_output_cache, r_conformer_cnn_cache)
+
+    def forward_chunk_by_chunk(
+            self,
+            xs: paddle.Tensor,
+            decoding_chunk_size: int,
+            num_decoding_left_chunks: int=-1,
+    ) -> Tuple[paddle.Tensor, paddle.Tensor]:
+        """ Forward input chunk by chunk with chunk_size like a streaming
+            fashion
+        Here we should pay special attention to computation cache in the
+        streaming style forward chunk by chunk. Three things should be taken
+        into account for computation in the current network:
+            1. transformer/conformer encoder layers output cache
+            2. convolution in conformer
+            3. convolution in subsampling
+        However, we don't implement subsampling cache for:
+            1. We can control subsampling module to output the right result by
+               overlapping input instead of cache left context, even though it
+               wastes some computation, but subsampling only takes a very
+               small fraction of computation in the whole model.
+            2. Typically, there are several covolution layers with subsampling
+               in subsampling module, it is tricky and complicated to do cache
+               with different convolution layers with different subsampling
+               rate.
+            3. Currently, nn.Sequential is used to stack all the convolution
+               layers in subsampling, we need to rewrite it to make it work
+               with cache, which is not prefered.
+        Args:
+            xs (paddle.Tensor): (1, max_len, dim)
+            chunk_size (int): decoding chunk size.
+            num_left_chunks (int): decoding with num left chunks.
+        """
+        assert decoding_chunk_size > 0
+        # The model is trained by static or dynamic chunk
+        assert self.static_chunk_size > 0 or self.use_dynamic_chunk
+
+        # feature stride and window for `subsampling` module
+        subsampling = self.embed.subsampling_rate
+        context = self.embed.right_context + 1  # Add current frame
+        stride = subsampling * decoding_chunk_size
+        decoding_window = (decoding_chunk_size - 1) * subsampling + context
+
+        num_frames = xs.size(1)
+        required_cache_size = decoding_chunk_size * num_decoding_left_chunks
+        subsampling_cache: Optional[paddle.Tensor] = None
+        elayers_output_cache: Optional[List[paddle.Tensor]] = None
+        conformer_cnn_cache: Optional[List[paddle.Tensor]] = None
+        outputs = []
+        offset = 0
+        # Feed forward overlap input step by step
+        for cur in range(0, num_frames - context + 1, stride):
+            end = min(cur + decoding_window, num_frames)
+            chunk_xs = xs[:, cur:end, :]
+            (y, subsampling_cache, elayers_output_cache,
+             conformer_cnn_cache) = self.forward_chunk(
+                 chunk_xs, offset, required_cache_size, subsampling_cache,
+                 elayers_output_cache, conformer_cnn_cache)
+            outputs.append(y)
+            offset += y.size(1)
+        ys = paddle.cat(outputs, 1)
+        # fake mask, just for jit script and compatibility with `forward` api
+        masks = paddle.ones([1, ys.size(1)], dtype=paddle.bool)
+        masks = masks.unsqueeze(1)
+        return ys, masks
+
+
+class TransformerEncoder(BaseEncoder):
+    """Transformer encoder module."""
+
+    def __init__(
+            self,
+            input_size: int,
+            output_size: int=256,
+            attention_heads: int=4,
+            linear_units: int=2048,
+            num_blocks: int=6,
+            dropout_rate: float=0.1,
+            positional_dropout_rate: float=0.1,
+            attention_dropout_rate: float=0.0,
+            input_layer: str="conv2d",
+            pos_enc_layer_type: str="abs_pos",
+            normalize_before: bool=True,
+            concat_after: bool=False,
+            static_chunk_size: int=0,
+            use_dynamic_chunk: bool=False,
+            global_cmvn: nn.Layer=None,
+            use_dynamic_left_chunk: bool=False, ):
+        """ Construct TransformerEncoder
+        See Encoder for the meaning of each parameter.
+        """
+        assert check_argument_types()
+        super().__init__(input_size, output_size, attention_heads, linear_units,
+                         num_blocks, dropout_rate, positional_dropout_rate,
+                         attention_dropout_rate, input_layer,
+                         pos_enc_layer_type, normalize_before, concat_after,
+                         static_chunk_size, use_dynamic_chunk, global_cmvn,
+                         use_dynamic_left_chunk)
+        self.encoders = nn.ModuleList([
+            TransformerEncoderLayer(
+                size=output_size,
+                self_attn=MultiHeadedAttention(attention_heads, output_size,
+                                               attention_dropout_rate),
+                feed_forward=PositionwiseFeedForward(output_size, linear_units,
+                                                     dropout_rate),
+                dropout_rate=dropout_rate,
+                normalize_before=normalize_before,
+                concat_after=concat_after) for _ in range(num_blocks)
+        ])
+
+
+class ConformerEncoder(BaseEncoder):
+    """Conformer encoder module."""
+
+    def __init__(
+            self,
+            input_size: int,
+            output_size: int=256,
+            attention_heads: int=4,
+            linear_units: int=2048,
+            num_blocks: int=6,
+            dropout_rate: float=0.1,
+            positional_dropout_rate: float=0.1,
+            attention_dropout_rate: float=0.0,
+            input_layer: str="conv2d",
+            pos_enc_layer_type: str="rel_pos",
+            normalize_before: bool=True,
+            concat_after: bool=False,
+            static_chunk_size: int=0,
+            use_dynamic_chunk: bool=False,
+            global_cmvn: nn.Layer=None,
+            use_dynamic_left_chunk: bool=False,
+            positionwise_conv_kernel_size: int=1,
+            macaron_style: bool=True,
+            selfattention_layer_type: str="rel_selfattn",
+            activation_type: str="swish",
+            use_cnn_module: bool=True,
+            cnn_module_kernel: int=15,
+            causal: bool=False,
+            cnn_module_norm: str="batch_norm", ):
+        """Construct ConformerEncoder
+        Args:
+            input_size to use_dynamic_chunk, see in BaseEncoder
+            positionwise_conv_kernel_size (int): Kernel size of positionwise
+                conv1d layer.
+            macaron_style (bool): Whether to use macaron style for
+                positionwise layer.
+            selfattention_layer_type (str): Encoder attention layer type,
+                the parameter has no effect now, it's just for configure
+                compatibility.
+            activation_type (str): Encoder activation function type.
+            use_cnn_module (bool): Whether to use convolution module.
+            cnn_module_kernel (int): Kernel size of convolution module.
+            causal (bool): whether to use causal convolution or not.
+            cnn_module_norm (str): cnn conv norm type, Optional['batch_norm','layer_norm']
+        """
+        assert check_argument_types()
+        super().__init__(input_size, output_size, attention_heads, linear_units,
+                         num_blocks, dropout_rate, positional_dropout_rate,
+                         attention_dropout_rate, input_layer,
+                         pos_enc_layer_type, normalize_before, concat_after,
+                         static_chunk_size, use_dynamic_chunk, global_cmvn,
+                         use_dynamic_left_chunk)
+        activation = get_activation(activation_type)
+
+        # self-attention module definition
+        encoder_selfattn_layer = RelPositionMultiHeadedAttention
+        encoder_selfattn_layer_args = (attention_heads, output_size,
+                                       attention_dropout_rate)
+        # feed-forward module definition
+        positionwise_layer = PositionwiseFeedForward
+        positionwise_layer_args = (output_size, linear_units, dropout_rate,
+                                   activation)
+        # convolution module definition
+        convolution_layer = ConvolutionModule
+        convolution_layer_args = (output_size, cnn_module_kernel, activation,
+                                  cnn_module_norm, causal)
+
+        self.encoders = nn.ModuleList([
+            ConformerEncoderLayer(
+                size=output_size,
+                self_attn=encoder_selfattn_layer(*encoder_selfattn_layer_args),
+                feed_forward=positionwise_layer(*positionwise_layer_args),
+                feed_forward_macaron=positionwise_layer(
+                    *positionwise_layer_args) if macaron_style else None,
+                conv_module=convolution_layer(*convolution_layer_args)
+                if use_cnn_module else None,
+                dropout_rate=dropout_rate,
+                normalize_before=normalize_before,
+                concat_after=concat_after) for _ in range(num_blocks)
+        ])
--- a/deepspeech/modules/encoder_layer.py
+++ b/deepspeech/modules/encoder_layer.py
@ -0,0 +1,284 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Encoder self-attention layer definition."""
+from typing import Optional
+from typing import Tuple
+
+import paddle
+from paddle import nn
+
+from deepspeech.utils.log import Log
+
+logger = Log(__name__).getlog()
+
+__all__ = ["TransformerEncoderLayer", "ConformerEncoderLayer"]
+
+
+class TransformerEncoderLayer(nn.Layer):
+    """Encoder layer module."""
+
+    def __init__(
+            self,
+            size: int,
+            self_attn: nn.Layer,
+            feed_forward: nn.Layer,
+            dropout_rate: float,
+            normalize_before: bool=True,
+            concat_after: bool=False, ):
+        """Construct an EncoderLayer object.
+        
+        Args:
+            size (int): Input dimension.
+            self_attn (nn.Layer): Self-attention module instance.
+                `MultiHeadedAttention` or `RelPositionMultiHeadedAttention`
+                instance can be used as the argument.
+            feed_forward (nn.Layer): Feed-forward module instance.
+                `PositionwiseFeedForward`, instance can be used as the argument.
+            dropout_rate (float): Dropout rate.
+            normalize_before (bool):
+                True: use layer_norm before each sub-block.
+                False: to use layer_norm after each sub-block.
+            concat_after (bool): Whether to concat attention layer's input and
+                output.
+                True: x -> x + linear(concat(x, att(x)))
+                False: x -> x + att(x)
+        """
+        super().__init__()
+        self.self_attn = self_attn
+        self.feed_forward = feed_forward
+        self.norm1 = nn.LayerNorm(size, epsilon=1e-12)
+        self.norm2 = nn.LayerNorm(size, epsilon=1e-12)
+        self.dropout = nn.Dropout(dropout_rate)
+        self.size = size
+        self.normalize_before = normalize_before
+        self.concat_after = concat_after
+        # concat_linear may be not used in forward fuction,
+        # but will be saved in the *.pt
+        self.concat_linear = nn.Linear(size + size, size)
+
+    def forward(
+            self,
+            x: paddle.Tensor,
+            mask: paddle.Tensor,
+            pos_emb: paddle.Tensor,
+            mask_pad: Optional[paddle.Tensor]=None,
+            output_cache: Optional[paddle.Tensor]=None,
+            cnn_cache: Optional[paddle.Tensor]=None,
+    ) -> Tuple[paddle.Tensor, paddle.Tensor, paddle.Tensor]:
+        """Compute encoded features.
+        Args:
+            x (paddle.Tensor): Input tensor (#batch, time, size).
+            mask (paddle.Tensor): Mask tensor for the input (#batch, time).
+            pos_emb (paddle.Tensor): just for interface compatibility
+                to ConformerEncoderLayer
+            mask_pad (paddle.Tensor): does not used in transformer layer,
+                just for unified api with conformer.
+            output_cache (paddle.Tensor): Cache tensor of the output
+                (#batch, time2, size), time2 < time in x.
+            cnn_cache (paddle.Tensor): not used here, it's for interface
+                compatibility to ConformerEncoderLayer
+        Returns:
+            paddle.Tensor: Output tensor (#batch, time, size).
+            paddle.Tensor: Mask tensor (#batch, time).
+            paddle.Tensor: Fake cnn cache tensor for api compatibility with Conformer (#batch, channels, time').
+        """
+        residual = x
+        if self.normalize_before:
+            x = self.norm1(x)
+
+        if output_cache is None:
+            x_q = x
+        else:
+            assert output_cache.shape[0] == x.shape[0]
+            assert output_cache.shape[1] < x.shape[1]
+            assert output_cache.shape[2] == self.size
+            chunk = x.shape[1] - output_cache.shape[1]
+            x_q = x[:, -chunk:, :]
+            residual = residual[:, -chunk:, :]
+            mask = mask[:, -chunk:, :]
+
+        if self.concat_after:
+            x_concat = paddle.concat(
+                (x, self.self_attn(x_q, x, x, mask)), axis=-1)
+            x = residual + self.concat_linear(x_concat)
+        else:
+            x = residual + self.dropout(self.self_attn(x_q, x, x, mask))
+        if not self.normalize_before:
+            x = self.norm1(x)
+
+        residual = x
+        if self.normalize_before:
+            x = self.norm2(x)
+        x = residual + self.dropout(self.feed_forward(x))
+        if not self.normalize_before:
+            x = self.norm2(x)
+
+        if output_cache is not None:
+            x = paddle.concat([output_cache, x], axis=1)
+
+        fake_cnn_cache = paddle.zeros([1], dtype=x.dtype)
+        return x, mask, fake_cnn_cache
+
+
+class ConformerEncoderLayer(nn.Layer):
+    """Encoder layer module."""
+
+    def __init__(
+            self,
+            size: int,
+            self_attn: nn.Layer,
+            feed_forward: Optional[nn.Layer]=None,
+            feed_forward_macaron: Optional[nn.Layer]=None,
+            conv_module: Optional[nn.Layer]=None,
+            dropout_rate: float=0.1,
+            normalize_before: bool=True,
+            concat_after: bool=False, ):
+        """Construct an EncoderLayer object.
+        
+        Args:
+            size (int): Input dimension.
+            self_attn (nn.Layer): Self-attention module instance.
+                `MultiHeadedAttention` or `RelPositionMultiHeadedAttention`
+                instance can be used as the argument.
+            feed_forward (nn.Layer): Feed-forward module instance.
+                `PositionwiseFeedForward` instance can be used as the argument.
+            feed_forward_macaron (nn.Layer): Additional feed-forward module
+                instance.
+                `PositionwiseFeedForward` instance can be used as the argument.
+            conv_module (nn.Layer): Convolution module instance.
+                `ConvlutionModule` instance can be used as the argument.
+            dropout_rate (float): Dropout rate.
+            normalize_before (bool):
+                True: use layer_norm before each sub-block.
+                False: use layer_norm after each sub-block.
+            concat_after (bool): Whether to concat attention layer's input and
+                output.
+                True: x -> x + linear(concat(x, att(x)))
+                False: x -> x + att(x)
+        """
+        super().__init__()
+        self.self_attn = self_attn
+        self.feed_forward = feed_forward
+        self.feed_forward_macaron = feed_forward_macaron
+        self.conv_module = conv_module
+        self.norm_ff = nn.LayerNorm(size, epsilon=1e-12)  # for the FNN module
+        self.norm_mha = nn.LayerNorm(size, epsilon=1e-12)  # for the MHA module
+        if feed_forward_macaron is not None:
+            self.norm_ff_macaron = nn.LayerNorm(size, epsilon=1e-12)
+            self.ff_scale = 0.5
+        else:
+            self.ff_scale = 1.0
+        if self.conv_module is not None:
+            self.norm_conv = nn.LayerNorm(
+                size, epsilon=1e-12)  # for the CNN module
+            self.norm_final = nn.LayerNorm(
+                size, epsilon=1e-12)  # for the final output of the block
+        self.dropout = nn.Dropout(dropout_rate)
+        self.size = size
+        self.normalize_before = normalize_before
+        self.concat_after = concat_after
+        self.concat_linear = nn.Linear(size + size, size)
+
+    def forward(
+            self,
+            x: paddle.Tensor,
+            mask: paddle.Tensor,
+            pos_emb: paddle.Tensor,
+            mask_pad: Optional[paddle.Tensor]=None,
+            output_cache: Optional[paddle.Tensor]=None,
+            cnn_cache: Optional[paddle.Tensor]=None,
+    ) -> Tuple[paddle.Tensor, paddle.Tensor, paddle.Tensor]:
+        """Compute encoded features.
+        Args:
+            x (paddle.Tensor): (#batch, time, size)
+            mask (paddle.Tensor): Mask tensor for the input (#batch, time，time).
+            pos_emb (paddle.Tensor): positional encoding, must not be None
+                for ConformerEncoderLayer.
+            mask_pad (paddle.Tensor): batch padding mask used for conv module, (B, 1, T).
+            output_cache (paddle.Tensor): Cache tensor of the encoder output
+                (#batch, time2, size), time2 < time in x.
+            cnn_cache (paddle.Tensor): Convolution cache in conformer layer
+        Returns:
+            paddle.Tensor: Output tensor (#batch, time, size).
+            paddle.Tensor: Mask tensor (#batch, time).
+            paddle.Tensor: New cnn cache tensor (#batch, channels, time').
+        """
+        # whether to use macaron style FFN
+        if self.feed_forward_macaron is not None:
+            residual = x
+            if self.normalize_before:
+                x = self.norm_ff_macaron(x)
+            x = residual + self.ff_scale * self.dropout(
+                self.feed_forward_macaron(x))
+            if not self.normalize_before:
+                x = self.norm_ff_macaron(x)
+
+        # multi-headed self-attention module
+        residual = x
+        if self.normalize_before:
+            x = self.norm_mha(x)
+
+        if output_cache is None:
+            x_q = x
+        else:
+            assert output_cache.shape[0] == x.shape[0]
+            assert output_cache.shape[1] < x.shape[1]
+            assert output_cache.shape[2] == self.size
+            chunk = x.shape[1] - output_cache.shape[1]
+            x_q = x[:, -chunk:, :]
+            residual = residual[:, -chunk:, :]
+            mask = mask[:, -chunk:, :]
+
+        x_att = self.self_attn(x_q, x, x, pos_emb, mask)
+
+        if self.concat_after:
+            x_concat = paddle.concat((x, x_att), axis=-1)
+            x = residual + self.concat_linear(x_concat)
+        else:
+            x = residual + self.dropout(x_att)
+
+        if not self.normalize_before:
+            x = self.norm_mha(x)
+
+        # convolution module
+        # Fake new cnn cache here, and then change it in conv_module
+        new_cnn_cache = paddle.zeros([1], dtype=x.dtype)
+        if self.conv_module is not None:
+            residual = x
+            if self.normalize_before:
+                x = self.norm_conv(x)
+
+            x, new_cnn_cache = self.conv_module(x, mask_pad, cnn_cache)
+            x = residual + self.dropout(x)
+
+            if not self.normalize_before:
+                x = self.norm_conv(x)
+
+        # feed forward module
+        residual = x
+        if self.normalize_before:
+            x = self.norm_ff(x)
+
+        x = residual + self.ff_scale * self.dropout(self.feed_forward(x))
+
+        if not self.normalize_before:
+            x = self.norm_ff(x)
+
+        if self.conv_module is not None:
+            x = self.norm_final(x)
+
+        if output_cache is not None:
+            x = paddle.concat([output_cache, x], axis=1)
+
+        return x, mask, new_cnn_cache
--- a/deepspeech/modules/loss.py
+++ b/deepspeech/modules/loss.py
@ -11,45 +11,15 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
-
-import logging
-
 import paddle
 from paddle import nn
 from paddle.nn import functional as F
-from paddle.nn import initializer as I
-
-logger = logging.getLogger(__name__)
-
-__all__ = ['CTCLoss']
-
-
-# TODO(Hui Zhang): remove this hack, when `norm_by_times=True` is added
-def ctc_loss(logits,
-             labels,
-             input_lengths,
-             label_lengths,
-             blank=0,
-             reduction='mean',
-             norm_by_times=True):
-    #logger.info("my ctc loss with norm by times")
-    ## https://github.com/PaddlePaddle/Paddle/blob/f5ca2db2cc/paddle/fluid/operators/warpctc_op.h#L403
-    loss_out = paddle.fluid.layers.warpctc(logits, labels, blank, norm_by_times,
-                                           input_lengths, label_lengths)

-    loss_out = paddle.fluid.layers.squeeze(loss_out, [-1])
-    logger.info(f"warpctc loss: {loss_out}/{loss_out.shape} ")
-    assert reduction in ['mean', 'sum', 'none']
-    if reduction == 'mean':
-        loss_out = paddle.mean(loss_out / label_lengths)
-    elif reduction == 'sum':
-        loss_out = paddle.sum(loss_out)
-    logger.info(f"ctc loss: {loss_out}")
-    return loss_out
+from deepspeech.utils.log import Log

+logger = Log(__name__).getlog()

-# TODO(Hui Zhang): remove this hack
-F.ctc_loss = ctc_loss
+__all__ = ['CTCLoss', "LabelSmoothingLoss"]


 class CTCLoss(nn.Layer):
@ -76,8 +46,98 @@ class CTCLoss(nn.Layer):
        # warp-ctc need activation with shape [T, B, V + 1]
        # logits: (B, L, D) -> (L, B, D)
        logits = logits.transpose([1, 0, 2])
+        # (TODO:Hui Zhang) ctc loss does not support int64 labels
+        ys_pad = ys_pad.astype(paddle.int32)
        loss = self.loss(logits, ys_pad, hlens, ys_lens)
        if self.batch_average:
            # Batch-size average
            loss = loss / B
        return loss
+
+
+class LabelSmoothingLoss(nn.Layer):
+    """Label-smoothing loss.
+    In a standard CE loss, the label's data distribution is:
+        [0,1,2] ->
+        [
+            [1.0, 0.0, 0.0],
+            [0.0, 1.0, 0.0],
+            [0.0, 0.0, 1.0],
+        ]
+    In the smoothing version CE Loss,some probabilities
+    are taken from the true label prob (1.0) and are divided
+    among other labels.
+        e.g.
+        smoothing=0.1
+        [0,1,2] ->
+        [
+            [0.9, 0.05, 0.05],
+            [0.05, 0.9, 0.05],
+            [0.05, 0.05, 0.9],
+        ]
+
+    """
+
+    def __init__(self,
+                 size: int,
+                 padding_idx: int,
+                 smoothing: float,
+                 normalize_length: bool=False):
+        """Label-smoothing loss.
+
+        Args:
+            size (int): the number of class
+            padding_idx (int): padding class id which will be ignored for loss
+            smoothing (float): smoothing rate (0.0 means the conventional CE)
+            normalize_length (bool): 
+                True, normalize loss by sequence length; 
+                False, normalize loss by batch size.
+                Defaults to False.
+        """
+        super().__init__()
+        self.size = size
+        self.padding_idx = padding_idx
+        self.smoothing = smoothing
+        self.confidence = 1.0 - smoothing
+        self.normalize_length = normalize_length
+        self.criterion = nn.KLDivLoss(reduction="none")
+
+    def forward(self, x: paddle.Tensor, target: paddle.Tensor) -> paddle.Tensor:
+        """Compute loss between x and target.
+        The model outputs and data labels tensors are flatten to
+        (batch*seqlen, class) shape and a mask is applied to the
+        padding part which should not be calculated for loss.
+        
+        Args:
+            x (paddle.Tensor): prediction (batch, seqlen, class)
+            target (paddle.Tensor):
+                target signal masked with self.padding_id (batch, seqlen)
+        Returns:
+            loss (paddle.Tensor) : The KL loss, scalar float value
+        """
+        B, T, D = paddle.shape(x)
+        assert D == self.size
+        x = x.reshape((-1, self.size))
+        target = target.reshape([-1])
+
+        # use zeros_like instead of torch.no_grad() for true_dist,
+        # since no_grad() can not be exported by JIT
+        true_dist = paddle.full_like(x, self.smoothing / (self.size - 1))
+        ignore = target == self.padding_idx  # (B,)
+
+        # target = target * (1 - ignore)  # avoid -1 index
+        target = target.masked_fill(ignore, 0)  # avoid -1 index
+        # true_dist.scatter_(1, target.unsqueeze(1), self.confidence)
+        target_mask = F.one_hot(target, self.size)
+        true_dist *= (1 - target_mask)
+        true_dist += target_mask * self.confidence
+
+        kl = self.criterion(F.log_softmax(x, axis=1), true_dist)
+
+        #TODO(Hui Zhang): sum not support bool type
+        #total = len(target) - int(ignore.sum())
+        total = len(target) - int(ignore.type_as(target).sum())
+        denom = total if self.normalize_length else B
+        #numer = (kl * (1 - ignore)).sum()
+        numer = kl.masked_fill(ignore.unsqueeze(1), 0).sum()
+        return numer / denom
--- a/deepspeech/modules/mask.py
+++ b/deepspeech/modules/mask.py
@ -11,20 +11,37 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
-
-import logging
-
 import paddle
-from paddle import nn
-from paddle.nn import functional as F
-from paddle.nn import initializer as I

-logger = logging.getLogger(__name__)
+from deepspeech.utils.log import Log
+
+logger = Log(__name__).getlog()

-__all__ = ['sequence_mask']
+__all__ = [
+    'sequence_mask', "make_pad_mask", "make_non_pad_mask", "subsequent_mask",
+    "subsequent_chunk_mask", "add_optional_chunk_mask", "mask_finished_scores",
+    "mask_finished_preds"
+]


 def sequence_mask(x_len, max_len=None, dtype='float32'):
+    """batch sequence mask.
+
+    Args:
+        x_len ([paddle.Tensor]): xs lenght, [B]
+        max_len ([type], optional): max sequence length. Defaults to None.
+        dtype (str, optional): mask data type. Defaults to 'float32'.
+
+    Returns:
+        paddle.Tensor: [B, Tmax]
+
+     Examples:
+        >>> sequence_mask([2, 4])
+        [[1., 1., 0., 0.],
+         [1., 1., 1., 1.]]
+    """
+    # (TODO: Hui Zhang): jit not support Tenosr.dim() and Tensor.ndim
+    # assert x_len.dim() == 1, (x_len.dim(), x_len)
    max_len = max_len or x_len.max()
    x_len = paddle.unsqueeze(x_len, -1)
    row_vector = paddle.arange(max_len)
@ -33,3 +50,236 @@ def sequence_mask(x_len, max_len=None, dtype='float32'):
    mask = row_vector > x_len  # a bug, broadcast 的时候出错了
    mask = paddle.cast(mask, dtype)
    return mask
+
+
+def make_pad_mask(lengths: paddle.Tensor) -> paddle.Tensor:
+    """Make mask tensor containing indices of padded part.
+    See description of make_non_pad_mask.
+    Args:
+        lengths (paddle.Tensor): Batch of lengths (B,).
+    Returns:
+        paddle.Tensor: Mask tensor containing indices of padded part.
+    Examples:
+        >>> lengths = [5, 3, 2]
+        >>> make_pad_mask(lengths)
+        masks = [[0, 0, 0, 0 ,0],
+                 [0, 0, 0, 1, 1],
+                 [0, 0, 1, 1, 1]]
+    """
+    assert lengths.dim() == 1
+    batch_size = int(lengths.shape[0])
+    max_len = int(lengths.max())
+    seq_range = paddle.arange(0, max_len, dtype=paddle.int64)
+    seq_range_expand = seq_range.unsqueeze(0).expand([batch_size, max_len])
+    seq_length_expand = lengths.unsqueeze(-1)
+    mask = seq_range_expand >= seq_length_expand
+    return mask
+
+
+def make_non_pad_mask(lengths: paddle.Tensor) -> paddle.Tensor:
+    """Make mask tensor containing indices of non-padded part.
+    The sequences in a batch may have different lengths. To enable
+    batch computing, padding is need to make all sequence in same
+    size. To avoid the padding part pass value to context dependent
+    block such as attention or convolution , this padding part is
+    masked.
+    This pad_mask is used in both encoder and decoder.
+    1 for non-padded part and 0 for padded part.
+    Args:
+        lengths (paddle.Tensor): Batch of lengths (B,).
+    Returns:
+        paddle.Tensor: mask tensor containing indices of padded part.
+    Examples:
+        >>> lengths = [5, 3, 2]
+        >>> make_non_pad_mask(lengths)
+        masks = [[1, 1, 1, 1 ,1],
+                 [1, 1, 1, 0, 0],
+                 [1, 1, 0, 0, 0]]
+    """
+    #TODO(Hui Zhang): return ~make_pad_mask(lengths), not support ~
+    return make_pad_mask(lengths).logical_not()
+
+
+def subsequent_mask(size: int) -> paddle.Tensor:
+    """Create mask for subsequent steps (size, size).
+    This mask is used only in decoder which works in an auto-regressive mode.
+    This means the current step could only do attention with its left steps.
+    In encoder, fully attention is used when streaming is not necessary and
+    the sequence is not long. In this case, no attention mask is needed.
+    When streaming is need, chunk-based attention is used in encoder. See
+    subsequent_chunk_mask for the chunk-based attention mask.
+    Args:
+        size (int): size of mask
+    Returns:
+        paddle.Tensor: mask, [size, size]
+    Examples:
+        >>> subsequent_mask(3)
+        [[1, 0, 0],
+         [1, 1, 0],
+         [1, 1, 1]]
+    """
+    ret = paddle.ones([size, size], dtype=paddle.bool)
+    #TODO(Hui Zhang): tril not support bool
+    #return paddle.tril(ret)
+    ret = ret.astype(paddle.float)
+    ret = paddle.tril(ret)
+    ret = ret.astype(paddle.bool)
+    return ret
+
+
+def subsequent_chunk_mask(
+        size: int,
+        chunk_size: int,
+        num_left_chunks: int=-1, ) -> paddle.Tensor:
+    """Create mask for subsequent steps (size, size) with chunk size,
+       this is for streaming encoder
+    Args:
+        size (int): size of mask
+        chunk_size (int): size of chunk
+        num_left_chunks (int): number of left chunks
+            <0: use full chunk
+            >=0: use num_left_chunks
+    Returns:
+        paddle.Tensor: mask, [size, size]
+    Examples:
+        >>> subsequent_chunk_mask(4, 2)
+        [[1, 1, 0, 0],
+         [1, 1, 0, 0],
+         [1, 1, 1, 1],
+         [1, 1, 1, 1]]
+    """
+    ret = torch.zeros([size, size], dtype=paddle.bool)
+    for i in range(size):
+        if num_left_chunks < 0:
+            start = 0
+        else:
+            start = max(0, (i // chunk_size - num_left_chunks) * chunk_size)
+        ending = min(size, (i // chunk_size + 1) * chunk_size)
+        ret[i, start:ending] = True
+    return ret
+
+
+def add_optional_chunk_mask(xs: paddle.Tensor,
+                            masks: paddle.Tensor,
+                            use_dynamic_chunk: bool,
+                            use_dynamic_left_chunk: bool,
+                            decoding_chunk_size: int,
+                            static_chunk_size: int,
+                            num_decoding_left_chunks: int):
+    """ Apply optional mask for encoder.
+    Args:
+        xs (paddle.Tensor): padded input, (B, L, D), L for max length
+        mask (paddle.Tensor): mask for xs, (B, 1, L)
+        use_dynamic_chunk (bool): whether to use dynamic chunk or not
+        use_dynamic_left_chunk (bool): whether to use dynamic left chunk for
+            training.
+        decoding_chunk_size (int): decoding chunk size for dynamic chunk, it's
+            0: default for training, use random dynamic chunk.
+            <0: for decoding, use full chunk.
+            >0: for decoding, use fixed chunk size as set.
+        static_chunk_size (int): chunk size for static chunk training/decoding
+            if it's greater than 0, if use_dynamic_chunk is true,
+            this parameter will be ignored
+        num_decoding_left_chunks (int): number of left chunks, this is for decoding,
+            the chunk size is decoding_chunk_size.
+            >=0: use num_decoding_left_chunks
+            <0: use all left chunks
+    Returns:
+        paddle.Tensor: chunk mask of the input xs.
+    """
+    # Whether to use chunk mask or not
+    if use_dynamic_chunk:
+        max_len = xs.shape[1]
+        if decoding_chunk_size < 0:
+            chunk_size = max_len
+            num_left_chunks = -1
+        elif decoding_chunk_size > 0:
+            chunk_size = decoding_chunk_size
+            num_left_chunks = num_decoding_left_chunks
+        else:
+            # chunk size is either [1, 25] or full context(max_len).
+            # Since we use 4 times subsampling and allow up to 1s(100 frames)
+            # delay, the maximum frame is 100 / 4 = 25.
+            chunk_size = int(paddle.randint(1, max_len, (1, )))
+            num_left_chunks = -1
+            if chunk_size > max_len // 2:
+                chunk_size = max_len
+            else:
+                chunk_size = chunk_size % 25 + 1
+                if use_dynamic_left_chunk:
+                    max_left_chunks = (max_len - 1) // chunk_size
+                    num_left_chunks = int(
+                        paddle.randint(0, max_left_chunks, (1, )))
+        chunk_masks = subsequent_chunk_mask(xs.shape[1], chunk_size,
+                                            num_left_chunks)  # (L, L)
+        chunk_masks = chunk_masks.unsqueeze(0)  # (1, L, L)
+        chunk_masks = masks & chunk_masks  # (B, L, L)
+    elif static_chunk_size > 0:
+        num_left_chunks = num_decoding_left_chunks
+        chunk_masks = subsequent_chunk_mask(xs.shape[1], static_chunk_size,
+                                            num_left_chunks)  # (L, L)
+        chunk_masks = chunk_masks.unsqueeze(0)  # (1, L, L)
+        chunk_masks = masks & chunk_masks  # (B, L, L)
+    else:
+        chunk_masks = masks
+    return chunk_masks
+
+
+def mask_finished_scores(score: paddle.Tensor,
+                         flag: paddle.Tensor) -> paddle.Tensor:
+    """
+    If a sequence is finished, we only allow one alive branch. This function
+    aims to give one branch a zero score and the rest -inf score.
+    Args:
+        score (paddle.Tensor): A real value array with shape
+            (batch_size * beam_size, beam_size).
+        flag (paddle.Tensor): A bool array with shape
+            (batch_size * beam_size, 1).
+    Returns:
+        paddle.Tensor: (batch_size * beam_size, beam_size).
+    Examples:
+        flag: tensor([[ True],
+                      [False]])
+        score: tensor([[-0.3666, -0.6664,  0.6019],
+                       [-1.1490, -0.2948,  0.7460]])
+        unfinished: tensor([[False,  True,  True],
+                            [False, False, False]])
+        finished: tensor([[ True, False, False],
+                          [False, False, False]])
+        return: tensor([[ 0.0000,    -inf,    -inf],
+                        [-1.1490, -0.2948,  0.7460]])
+    """
+    beam_size = score.shape[-1]
+    zero_mask = paddle.zeros_like(flag, dtype=paddle.bool)
+    if beam_size > 1:
+        unfinished = paddle.concat(
+            (zero_mask, flag.tile([1, beam_size - 1])), axis=1)
+        finished = paddle.concat(
+            (flag, zero_mask.tile([1, beam_size - 1])), axis=1)
+    else:
+        unfinished = zero_mask
+        finished = flag
+
+    # infs = paddle.ones_like(score) * -float('inf')
+    # score = paddle.where(unfinished, infs, score)
+    # score = paddle.where(finished, paddle.zeros_like(score), score)
+    score.masked_fill_(unfinished, -float('inf'))
+    score.masked_fill_(finished, 0)
+    return score
+
+
+def mask_finished_preds(pred: paddle.Tensor, flag: paddle.Tensor,
+                        eos: int) -> paddle.Tensor:
+    """
+    If a sequence is finished, all of its branch should be <eos>
+    Args:
+        pred (paddle.Tensor): A int array with shape
+            (batch_size * beam_size, beam_size).
+        flag (paddle.Tensor): A bool array with shape
+            (batch_size * beam_size, 1).
+    Returns:
+        paddle.Tensor: (batch_size * beam_size).
+    """
+    beam_size = pred.shape[-1]
+    finished = flag.repeat(1, beam_size)
+    return pred.masked_fill_(finished, eos)
--- a/deepspeech/modules/positionwise_feed_forward.py
+++ b/deepspeech/modules/positionwise_feed_forward.py
@ -0,0 +1,57 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Positionwise feed forward layer definition."""
+import paddle
+from paddle import nn
+
+from deepspeech.utils.log import Log
+
+logger = Log(__name__).getlog()
+
+__all__ = ["PositionwiseFeedForward"]
+
+
+class PositionwiseFeedForward(nn.Layer):
+    """Positionwise feed forward layer."""
+
+    def __init__(self,
+                 idim: int,
+                 hidden_units: int,
+                 dropout_rate: float,
+                 activation: nn.Layer=nn.ReLU()):
+        """Construct a PositionwiseFeedForward object.
+
+        FeedForward are appied on each position of the sequence.
+        The output dim is same with the input dim.
+
+        Args:
+            idim (int): Input dimenstion.
+            hidden_units (int): The number of hidden units.
+            dropout_rate (float): Dropout rate.
+            activation (paddle.nn.Layer): Activation function
+        """
+        super().__init__()
+        self.w_1 = nn.Linear(idim, hidden_units)
+        self.activation = activation
+        self.dropout = nn.Dropout(dropout_rate)
+        self.w_2 = nn.Linear(hidden_units, idim)
+
+    def forward(self, xs: paddle.Tensor) -> paddle.Tensor:
+        """Forward function.
+        Args:
+            xs: input tensor (B, Lmax, D)
+        Returns:
+            output tensor, (B, Lmax, D)
+        """
+        return self.w_2(self.dropout(self.activation(self.w_1(xs))))
--- a/deepspeech/modules/rnn.py
+++ b/deepspeech/modules/rnn.py
@ -11,19 +11,18 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
-
 import math
-import logging

 import paddle
 from paddle import nn
 from paddle.nn import functional as F
 from paddle.nn import initializer as I

-from deepspeech.modules.mask import sequence_mask
 from deepspeech.modules.activation import brelu
+from deepspeech.modules.mask import sequence_mask
+from deepspeech.utils.log import Log

-logger = logging.getLogger(__name__)
+logger = Log(__name__).getlog()

 __all__ = ['RNNStack']

@ -41,7 +40,7 @@ class RNNCell(nn.RNNCellBase):
    """

    def __init__(self,
-                 hidden_size,
+                 hidden_size: int,
                 activation="tanh",
                 weight_ih_attr=None,
                 weight_hh_attr=None,
@ -108,8 +107,8 @@ class GRUCell(nn.RNNCellBase):
    """

    def __init__(self,
-                 input_size,
-                 hidden_size,
+                 input_size: int,
+                 hidden_size: int,
                 weight_ih_attr=None,
                 weight_hh_attr=None,
                 bias_ih_attr=None,
@ -132,7 +131,6 @@ class GRUCell(nn.RNNCellBase):
        self.input_size = input_size
        self._gate_activation = F.sigmoid
        self._activation = paddle.tanh
-        #self._activation = F.relu

    def forward(self, inputs, states=None):
        if states is None:
@ -171,8 +169,6 @@ class BiRNNWithBN(nn.Layer):
    """Bidirectonal simple rnn layer with sequence-wise batch normalization.
    The batch normalization is only performed on input-state weights.

-    :param name: Name of the layer parameters.
-    :type name: string
    :param size: Dimension of RNN cells.
    :type size: int
    :param share_weights: Whether to share input-hidden weights between
@ -182,7 +178,7 @@ class BiRNNWithBN(nn.Layer):
    :rtype: Variable
    """

-    def __init__(self, i_size, h_size, share_weights):
+    def __init__(self, i_size: int, h_size: int, share_weights: bool):
        super().__init__()
        self.share_weights = share_weights
        if self.share_weights:
@ -208,7 +204,7 @@ class BiRNNWithBN(nn.Layer):
        self.bw_rnn = nn.RNN(
            self.fw_cell, is_reverse=True, time_major=False)  #[B, T, D]

-    def forward(self, x, x_len):
+    def forward(self, x: paddle.Tensor, x_len: paddle.Tensor):
        # x, shape [B, T, D]
        fw_x = self.fw_bn(self.fw_fc(x))
        bw_x = self.bw_bn(self.bw_fc(x))
@ -234,7 +230,7 @@ class BiGRUWithBN(nn.Layer):
    :rtype: Variable
    """

-    def __init__(self, i_size, h_size, act):
+    def __init__(self, i_size: int, h_size: int):
        super().__init__()
        hidden_size = h_size * 3

@ -281,23 +277,29 @@ class RNNStack(nn.Layer):
    :rtype: Variable
    """

-    def __init__(self, i_size, h_size, num_stacks, use_gru, share_rnn_weights):
+    def __init__(self,
+                 i_size: int,
+                 h_size: int,
+                 num_stacks: int,
+                 use_gru: bool,
+                 share_rnn_weights: bool):
        super().__init__()
-        self.rnn_stacks = nn.LayerList()
+        rnn_stacks = []
        for i in range(num_stacks):
            if use_gru:
                #default:GRU using tanh
-                self.rnn_stacks.append(
-                    BiGRUWithBN(i_size=i_size, h_size=h_size, act="relu"))
+                rnn_stacks.append(BiGRUWithBN(i_size=i_size, h_size=h_size))
            else:
-                self.rnn_stacks.append(
+                rnn_stacks.append(
                    BiRNNWithBN(
                        i_size=i_size,
                        h_size=h_size,
                        share_weights=share_rnn_weights))
            i_size = h_size * 2

-    def forward(self, x, x_len):
+        self.rnn_stacks = nn.ModuleList(rnn_stacks)
+
+    def forward(self, x: paddle.Tensor, x_len: paddle.Tensor):
        """
        x: shape [B, T, D]
        x_len: shpae [B]
--- a/deepspeech/modules/subsampling.py
+++ b/deepspeech/modules/subsampling.py
@ -0,0 +1,239 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Subsampling layer definition."""
+from typing import Tuple
+
+import paddle
+from paddle import nn
+
+from deepspeech.modules.embedding import PositionalEncoding
+from deepspeech.utils.log import Log
+
+logger = Log(__name__).getlog()
+
+__all__ = [
+    "LinearNoSubsampling", "Conv2dSubsampling4", "Conv2dSubsampling6",
+    "Conv2dSubsampling8"
+]
+
+
+class BaseSubsampling(nn.Layer):
+    def __init__(self, pos_enc_class: nn.Layer=PositionalEncoding):
+        super().__init__()
+        self.pos_enc = pos_enc_class
+        # window size = (1 + right_context) + (chunk_size -1) * subsampling_rate
+        self.right_context = 0
+        # stride = subsampling_rate * chunk_size
+        self.subsampling_rate = 1
+
+    def position_encoding(self, offset: int, size: int) -> paddle.Tensor:
+        return self.pos_enc.position_encoding(offset, size)
+
+
+class LinearNoSubsampling(BaseSubsampling):
+    """Linear transform the input without subsampling."""
+
+    def __init__(self,
+                 idim: int,
+                 odim: int,
+                 dropout_rate: float,
+                 pos_enc_class: nn.Layer=PositionalEncoding):
+        """Construct an linear object.
+        Args:
+            idim (int): Input dimension.
+            odim (int): Output dimension.
+            dropout_rate (float): Dropout rate.
+            pos_enc_class (PositionalEncoding): position encoding class
+        """
+        super().__init__(pos_enc_class)
+        self.out = nn.Sequential(
+            nn.Linear(idim, odim),
+            nn.LayerNorm(odim, epsilon=1e-12),
+            nn.Dropout(dropout_rate), )
+        self.right_context = 0
+        self.subsampling_rate = 1
+
+    def forward(self, x: paddle.Tensor, x_mask: paddle.Tensor, offset: int=0
+                ) -> Tuple[paddle.Tensor, paddle.Tensor, paddle.Tensor]:
+        """Input x.
+        Args:
+            x (paddle.Tensor): Input tensor (#batch, time, idim).
+            x_mask (paddle.Tensor): Input mask (#batch, 1, time).
+            offset (int): position encoding offset.
+        Returns:
+            paddle.Tensor: linear input tensor (#batch, time', odim),
+                where time' = time .
+            paddle.Tensor: positional encoding
+            paddle.Tensor: linear input mask (#batch, 1, time'),
+                where time' = time .
+        """
+        x = self.out(x)
+        x, pos_emb = self.pos_enc(x, offset)
+        return x, pos_emb, x_mask
+
+
+class Conv2dSubsampling4(BaseSubsampling):
+    """Convolutional 2D subsampling (to 1/4 length)."""
+
+    def __init__(self,
+                 idim: int,
+                 odim: int,
+                 dropout_rate: float,
+                 pos_enc_class: nn.Layer=PositionalEncoding):
+        """Construct an Conv2dSubsampling4 object.
+        
+        Args:
+            idim (int): Input dimension.
+            odim (int): Output dimension.
+            dropout_rate (float): Dropout rate.
+        """
+        super().__init__(pos_enc_class)
+        self.conv = nn.Sequential(
+            nn.Conv2D(1, odim, 3, 2),
+            nn.ReLU(),
+            nn.Conv2D(odim, odim, 3, 2),
+            nn.ReLU(), )
+        self.out = nn.Sequential(
+            nn.Linear(odim * (((idim - 1) // 2 - 1) // 2), odim))
+        self.subsampling_rate = 4
+        # The right context for every conv layer is computed by:
+        # (kernel_size - 1) / 2 * stride  * frame_rate_of_this_layer
+        # 6 = (3 - 1) / 2 * 2 * 1 + (3 - 1) / 2 * 2 * 2
+        self.right_context = 6
+
+    def forward(self, x: paddle.Tensor, x_mask: paddle.Tensor, offset: int=0
+                ) -> Tuple[paddle.Tensor, paddle.Tensor, paddle.Tensor]:
+        """Subsample x.
+        Args:
+            x (paddle.Tensor): Input tensor (#batch, time, idim).
+            x_mask (paddle.Tensor): Input mask (#batch, 1, time).
+            offset (int): position encoding offset.
+        Returns:
+            paddle.Tensor: Subsampled tensor (#batch, time', odim),
+                where time' = time // 4.
+            paddle.Tensor: positional encoding
+            paddle.Tensor: Subsampled mask (#batch, 1, time'),
+                where time' = time // 4.
+        """
+        x = x.unsqueeze(1)  # (b, c=1, t, f)
+        x = self.conv(x)
+        b, c, t, f = paddle.shape(x)
+        x = self.out(x.transpose([0, 2, 1, 3]).reshape([b, t, c * f]))
+        x, pos_emb = self.pos_enc(x, offset)
+        return x, pos_emb, x_mask[:, :, :-2:2][:, :, :-2:2]
+
+
+class Conv2dSubsampling6(BaseSubsampling):
+    """Convolutional 2D subsampling (to 1/6 length)."""
+
+    def __init__(self,
+                 idim: int,
+                 odim: int,
+                 dropout_rate: float,
+                 pos_enc_class: nn.Layer=PositionalEncoding):
+        """Construct an Conv2dSubsampling6 object.
+        
+        Args:
+            idim (int): Input dimension.
+            odim (int): Output dimension.
+            dropout_rate (float): Dropout rate.
+            pos_enc (PositionalEncoding): Custom position encoding layer.
+        """
+        super().__init__(pos_enc_class)
+        self.conv = nn.Sequential(
+            nn.Conv2D(1, odim, 3, 2),
+            nn.ReLU(),
+            nn.Conv2D(odim, odim, 5, 3),
+            nn.ReLU(), )
+        # O = (I - F + Pstart + Pend) // S + 1
+        # when Padding == 0, O = (I - F - S) // S
+        self.linear = nn.Linear(odim * (((idim - 1) // 2 - 2) // 3), odim)
+        # The right context for every conv layer is computed by:
+        # (kernel_size - 1) / 2 * stride  * frame_rate_of_this_layer
+        # 14 = (3 - 1) / 2 * 2 * 1 + (5 - 1) / 2 * 3 * 2
+        self.subsampling_rate = 6
+        self.right_context = 14
+
+    def forward(self, x: paddle.Tensor, x_mask: paddle.Tensor, offset: int=0
+                ) -> Tuple[paddle.Tensor, paddle.Tensor, paddle.Tensor]:
+        """Subsample x.
+        Args:
+            x (paddle.Tensor): Input tensor (#batch, time, idim).
+            x_mask (paddle.Tensor): Input mask (#batch, 1, time).
+            offset (int): position encoding offset.
+        Returns:
+            paddle.Tensor: Subsampled tensor (#batch, time', odim),
+                where time' = time // 6.
+            paddle.Tensor: positional encoding
+            paddle.Tensor: Subsampled mask (#batch, 1, time'),
+                where time' = time // 6.
+        """
+        x = x.unsqueeze(1)  # (b, c, t, f)
+        x = self.conv(x)
+        b, c, t, f = paddle.shape(x)
+        x = self.linear(x.transpose([0, 2, 1, 3]).reshape([b, t, c * f]))
+        x, pos_emb = self.pos_enc(x, offset)
+        return x, pos_emb, x_mask[:, :, :-2:2][:, :, :-4:3]
+
+
+class Conv2dSubsampling8(BaseSubsampling):
+    """Convolutional 2D subsampling (to 1/8 length)."""
+
+    def __init__(self,
+                 idim: int,
+                 odim: int,
+                 dropout_rate: float,
+                 pos_enc_class: nn.Layer=PositionalEncoding):
+        """Construct an Conv2dSubsampling8 object.
+        
+        Args:
+            idim (int): Input dimension.
+            odim (int): Output dimension.
+            dropout_rate (float): Dropout rate.
+        """
+        super().__init__(pos_enc_class)
+        self.conv = nn.Sequential(
+            nn.Conv2D(1, odim, 3, 2),
+            nn.ReLU(),
+            nn.Conv2D(odim, odim, 3, 2),
+            nn.ReLU(),
+            nn.Conv2D(odim, odim, 3, 2),
+            nn.ReLU(), )
+        self.linear = nn.Linear(odim * ((((idim - 1) // 2 - 1) // 2 - 1) // 2),
+                                odim)
+        self.subsampling_rate = 8
+        # The right context for every conv layer is computed by:
+        # (kernel_size - 1) / 2 * stride  * frame_rate_of_this_layer
+        # 14 = (3 - 1) / 2 * 2 * 1 + (3 - 1) / 2 * 2 * 2 + (3 - 1) / 2 * 2 * 4
+        self.right_context = 14
+
+    def forward(self, x: paddle.Tensor, x_mask: paddle.Tensor, offset: int=0
+                ) -> Tuple[paddle.Tensor, paddle.Tensor, paddle.Tensor]:
+        """Subsample x.
+        Args:
+            x (paddle.Tensor): Input tensor (#batch, time, idim).
+            x_mask (paddle.Tensor): Input mask (#batch, 1, time).
+            offset (int): position encoding offset.
+        Returns:
+            paddle.Tensor: Subsampled tensor (#batch, time', odim),
+                where time' = time // 8.
+            paddle.Tensor: positional encoding
+            paddle.Tensor: Subsampled mask (#batch, 1, time'),
+                where time' = time // 8.
+        """
+        x = x.unsqueeze(1)  # (b, c, t, f)
+        x = self.conv(x)
+        x = self.linear(x.transpose([0, 2, 1, 3]).reshape([b, t, c * f]))
+        x, pos_emb = self.pos_enc(x, offset)
+        return x, pos_emb, x_mask[:, :, :-2:2][:, :, :-2:2][:, :, :-2:2]
--- a/deepspeech/training/init.py
+++ b/deepspeech/training/init.py
@ -11,5 +11,3 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
-
-from deepspeech.training.trainer import *
--- a/deepspeech/training/cli.py
+++ b/deepspeech/training/cli.py
@ -11,7 +11,6 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
-
 import argparse


@ -57,13 +56,19 @@ def default_argument_parser():
    # save jit model to 
    parser.add_argument("--export_path", type=str, help="path of the jit model to save")

+    # save asr result to 
+    parser.add_argument("--result_file", type=str, help="path of save the asr result")
+
    # running
-    parser.add_argument("--device", type=str, default='gpu', choices=["cpu", "gpu"], help="device type to use, cpu and gpu are supported.")
+    parser.add_argument("--device", type=str, default='gpu', choices=["cpu", "gpu"],
+                        help="device type to use, cpu and gpu are supported.")
    parser.add_argument("--nprocs", type=int, default=1, help="number of parallel processes to use.")

    # overwrite extra config and default config
-    #parser.add_argument("--opts", nargs=argparse.REMAINDER, help="options to overwrite --config file and the default config, passing in KEY VALUE pairs")
-    parser.add_argument("--opts", type=str, default=[], nargs='+', help="options to overwrite --config file and the default config, passing in KEY VALUE pairs")
+    # parser.add_argument("--opts", nargs=argparse.REMAINDER, 
+    # help="options to overwrite --config file and the default config, passing in KEY VALUE pairs")
+    parser.add_argument("--opts", type=str, default=[], nargs='+',
+                        help="options to overwrite --config file and the default config, passing in KEY VALUE pairs")
    # yapd: enable

    return parser
--- a/deepspeech/training/gradclip.py
+++ b/deepspeech/training/gradclip.py
@ -11,18 +11,19 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
-
-import logging
-
 import paddle
-from paddle.fluid.dygraph import base as imperative_base
-from paddle.fluid import layers
 from paddle.fluid import core
+from paddle.fluid import layers
+from paddle.fluid.dygraph import base as imperative_base

-logger = logging.getLogger(__name__)
+from deepspeech.utils.log import Log

+__all__ = ["ClipGradByGlobalNormWithLog"]

-class MyClipGradByGlobalNorm(paddle.nn.ClipGradByGlobalNorm):
+logger = Log(__name__).getlog()
+
+
+class ClipGradByGlobalNormWithLog(paddle.nn.ClipGradByGlobalNorm):
    def __init__(self, clip_norm):
        super().__init__(clip_norm)

@ -41,11 +42,11 @@ class MyClipGradByGlobalNorm(paddle.nn.ClipGradByGlobalNorm):
                merge_grad = layers.get_tensor_from_selected_rows(merge_grad)
            square = layers.square(merge_grad)
            sum_square = layers.reduce_sum(square)
-            logger.info(
-                f"Grad Before Clip: {p.name}: {float(layers.sqrt(layers.reduce_sum(layers.square(merge_grad))) ) }"
-            )
            sum_square_list.append(sum_square)

+            # debug log
+            # logger.debug(f"Grad Before Clip: {p.name}: {float(sum_square.sqrt()) }")
+
        # all parameters have been filterd out
        if len(sum_square_list) == 0:
            return params_grads
@ -53,7 +54,9 @@ class MyClipGradByGlobalNorm(paddle.nn.ClipGradByGlobalNorm):
        global_norm_var = layers.concat(sum_square_list)
        global_norm_var = layers.reduce_sum(global_norm_var)
        global_norm_var = layers.sqrt(global_norm_var)
-        logger.info(f"Grad Global Norm: {float(global_norm_var)}!!!!")
+        # debug log
+        logger.debug(f"Grad Global Norm: {float(global_norm_var)}!!!!")
+
        max_global_norm = layers.fill_constant(
            shape=[1], dtype=global_norm_var.dtype, value=self.clip_norm)
        clip_var = layers.elementwise_div(
@ -66,9 +69,11 @@ class MyClipGradByGlobalNorm(paddle.nn.ClipGradByGlobalNorm):
                params_and_grads.append((p, g))
                continue
            new_grad = layers.elementwise_mul(x=g, y=clip_var)
-            logger.info(
-                f"Grad After Clip: {p.name}: {float(layers.sqrt(layers.reduce_sum(layers.square(merge_grad))) ) }"
-            )
            params_and_grads.append((p, new_grad))

+            # debug log
+            # logger.debug(
+            #     f"Grad After Clip: {p.name}: {float(merge_grad.square().sum().sqrt())}"
+            # )
+
        return params_and_grads
--- a/deepspeech/training/scheduler.py
+++ b/deepspeech/training/scheduler.py
@ -0,0 +1,66 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from typing import Union
+
+from paddle.optimizer.lr import LRScheduler
+from typeguard import check_argument_types
+
+from deepspeech.utils.log import Log
+
+__all__ = ["WarmupLR"]
+
+logger = Log(__name__).getlog()
+
+
+class WarmupLR(LRScheduler):
+    """The WarmupLR scheduler
+    This scheduler is almost same as NoamLR Scheduler except for following
+    difference:
+    NoamLR:
+        lr = optimizer.lr * model_size ** -0.5
+             * min(step ** -0.5, step * warmup_step ** -1.5)
+    WarmupLR:
+        lr = optimizer.lr * warmup_step ** 0.5
+             * min(step ** -0.5, step * warmup_step ** -1.5)
+    Note that the maximum lr equals to optimizer.lr in this scheduler.
+    """
+
+    def __init__(self,
+                 warmup_steps: Union[int, float]=25000,
+                 learning_rate=1.0,
+                 last_epoch=-1,
+                 verbose=False):
+        assert check_argument_types()
+        self.warmup_steps = warmup_steps
+        super().__init__(learning_rate, last_epoch, verbose)
+
+    def __repr__(self):
+        return f"{self.__class__.__name__}(warmup_steps={self.warmup_steps})"
+
+    def get_lr(self):
+        step_num = self.last_epoch + 1
+        return self.base_lr * self.warmup_steps**0.5 * min(
+            step_num**-0.5, step_num * self.warmup_steps**-1.5)
+
+    def set_step(self, step: int=None):
+        '''
+        It will update the learning rate in optimizer according to current ``epoch`` .  
+        The new learning rate will take effect on next ``optimizer.step`` .
+        
+        Args:
+            step (int, None): specify current epoch. Default: None. Auto-increment from last_epoch=-1.
+        Returns:
+            None
+        '''
+        self.step(epoch=step)
--- a/Show More
+++ b/Show More