E2E/Streaming Transformer/Conformer ASR (#578)

* add cmvn and label smoothing loss layer * add layer for transformer * add glu and conformer conv * add torch compatiable hack, mask funcs * not hack size since it exists * add test; attention * add attention, common utils, hack paddle * add audio utils * conformer batch padding mask bug fix #223 * fix typo, python infer fix rnn mem opt name error and batchnorm1d, will be available at 2.0.2 * fix ci * fix ci * add encoder * refactor egs * add decoder * refactor ctc, add ctc align, refactor ckpt, add warmup lr scheduler, cmvn utils * refactor docs * add fix * fix readme * fix bugs, refactor collator, add pad_sequence, fix ckpt bugs * fix docstring * refactor data feed order * add u2 model * refactor cmvn, test * add utils * add u2 config * fix bugs * fix bugs * fix autograd maybe has problem when using inplace operation * refactor data, build vocab; add format data * fix text featurizer * refactor build vocab * add fbank, refactor feature of speech * refactor audio feat * refactor data preprare * refactor data * model init from config * add u2 bins * flake8 * can train * fix bugs, add coverage, add scripts * test can run * fix data * speed perturb with sox * add spec aug * fix for train * fix train logitc * fix logger * log valid loss, time dataset process * using np for speed perturb, remove some debug log of grad clip * fix logger * fix build vocab * fix logger name * using module logger as default * fix * fix install * reorder imports * fix board logger * fix logger * kaldi fbank and mfcc * fix cmvn and print prarams * fix add_eos_sos and cmvn * fix cmvn compute * fix logger and cmvn * fix subsampling, label smoothing loss, remove useless * add notebook test * fix log * fix tb logger * multi gpu valid * fix log * fix log * fix config * fix compute cmvn, need paddle 2.1 * add cmvn notebook * fix layer tools * fix compute cmvn * add rtf * fix decoding * fix layer tools * fix log, add avg script * more avg and test info * fix dataset pickle problem; using 2.1 paddle; num_workers can > 0; ckpt save in exp dir;fix setup.sh; * add vimrc * refactor tiny script, add transformer and stream conf * spm demo; librisppech scripts and confs * fix log * add librispeech scripts * refactor data pipe; fix conf; fix u2 default params * fix bugs * refactor aishell scripts * fix test * fix cmvn * fix s0 scripts * fix ds2 scripts and bugs * fix dev & test dataset filter * fix dataset filter * filter dev * fix ckpt path * filter test, since librispeech will cause OOM, but all test wer will be worse, since mismatch train with test * add comment * add syllable doc * fix ds2 configs * add doc * add pypinyin tools * fix decoder using blank_id=0 * mmseg with pybind11 * format code
3 years ago · 71e046b0ba
parent 3a2de9e461
commit 71e046b0ba
446 changed files with 1414633 additions and 2732 deletions
--- a/.clang-format
+++ b/.clang-format
@ -16,8 +16,8 @@
 ---
 Language:        Cpp
 BasedOnStyle:  Google
-IndentWidth:     2
+IndentWidth:     4
-TabWidth:        2
+TabWidth:        4
 ContinuationIndentWidth: 4
 MaxEmptyLinesToKeep: 2
 AccessModifierOffset: -2  # The private/protected/public has no indent in class
--- a/.flake8
+++ b/.flake8
@ -0,0 +1,50 @@
 [flake8]
 ########## OPTIONS ##########
 # Set the maximum length that any line (with some exceptions) may be.
 max-line-length = 120
 ################### FILE PATTERNS ##########################
 # Provide a comma-separated list of glob patterns to exclude from checks.
 exclude =
    # git folder
    .git,
    # python cache
    __pycache__,
    third_party/,
 # Provide a comma-separate list of glob patterns to include for checks.
 filename =
    *.py
 ########## RULES ##########
 # ERROR CODES
 #
 # E/W  - PEP8 errors/warnings (pycodestyle)
 # F    - linting errors (pyflakes)
 # C    - McCabe complexity error (mccabe)
 #
 # W503 - line break before binary operator
 # Specify a list of codes to ignore.
 ignore =
    W503
    E252,E262,E127,E265,E126,E266,E241,E261,E128,E125
    W291,W293,W605
    E203,E305,E402,E501,E721,E741,F403,F405,F821,F841,F999,W503,W504,C408,E302,W291,E303,
    # shebang has extra meaning in fbcode lints, so I think it's not worth trying
    # to line this up with executable bit
    EXE001,
    # these ignores are from flake8-bugbear; please fix!
    B007,B008,
    # these ignores are from flake8-comprehensions; please fix!
    C400,C401,C402,C403,C404,C405,C407,C411,C413,C414,C415
 # Specify the list of error codes you wish Flake8 to report.
 select =
    E,
    W,
    F,
    C
--- a/.gitconfig
+++ b/.gitconfig
@ -0,0 +1,48 @@
 [alias]
  st = status
  ci = commit
  br = branch
  co = checkout
  df = diff
  l = log --pretty=format:\"%h %ad | %s%d [%an]\" --graph --date=short
  ll = log --stat
 [merge]
  tool = vimdiff
 [core]
  excludesfile = ~/.gitignore
  editor = vim
 [color]
  branch = auto
  diff = auto
  status = auto
 [color "branch"]
  current = yellow reverse
  local = yellow
  remote = green
 [color "diff"]
  meta = yellow bold
  frag = magenta bold
  old = red bold
  new = green bold
 [color "status"]
  added = yellow
  changed = green
  untracked = cyan
 [push]
  default = matching
 [credential]
  helper = store
 [user]
  name =
  email =
--- a/.gitignore
+++ b/.gitignore
@ -5,3 +5,8 @@ tools/venv
 *.log
 *.pdmodel
 *.pdiparams*
 *.zip
 *.tar
 *.tar.gz
 .ipynb_checkpoints
 *.npz
--- a/.notebook/Linear_test.ipynb
+++ b/.notebook/Linear_test.ipynb
@ -0,0 +1,605 @@
 {
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "academic-surname",
   "metadata": {},
   "outputs": [],
   "source": [
    "import paddle\n",
    "from paddle import nn"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "fundamental-treasure",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/workspace/DeepSpeech-2.x/tools/venv-dev/lib/python3.7/site-packages/ipykernel/ipkernel.py:283: DeprecationWarning: `should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.\n",
      "  and should_run_async(code)\n"
     ]
    }
   ],
   "source": [
    "L = nn.Linear(256, 2048)\n",
    "L2 = nn.Linear(2048, 256)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "consolidated-elephant",
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import torch\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "moderate-noise",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "float64\n",
      "Tensor(shape=[2, 51, 256], dtype=float32, place=CUDAPlace(0), stop_gradient=True,\n",
      "       [[[-1.54171216, -2.61531472, -1.79881978, ..., -0.31395876,  0.56513089, -0.44516513],\n",
      "         [-0.79492962,  1.91157901,  0.66567147, ...,  0.54825783, -1.01471853, -0.84924090],\n",
      "         [-1.22556651, -0.36225814,  0.65063190, ...,  0.65726501,  0.05563191,  0.09009409],\n",
      "         ...,\n",
      "         [ 0.38615900, -0.77905393,  0.99732304, ..., -1.38463700, -3.32365036, -1.31089687],\n",
      "         [ 0.05579993,  0.06885809, -1.66662002, ..., -0.23346378, -3.29372883,  1.30561364],\n",
      "         [ 1.90676069,  1.95093191, -0.28849599, ..., -0.06860496,  0.95347673,  1.00475824]],\n",
      "\n",
      "        [[-0.91453546,  0.55298805, -1.06146812, ..., -0.86378336,  1.00454640,  1.26062179],\n",
      "         [ 0.10223761,  0.81301165,  2.36865163, ...,  0.16821407,  0.29240361,  1.05408621],\n",
      "         [-1.33196676,  1.94433689,  0.01934209, ...,  0.48036841,  0.51585966,  1.22893548],\n",
      "         ...,\n",
      "         [-0.19558455, -0.47075930,  0.90796155, ..., -1.28598249, -0.24321797,  0.17734711],\n",
      "         [ 0.89819717, -1.39516675,  0.17138045, ...,  2.39761519,  1.76364994, -0.52177650],\n",
      "         [ 0.94122332, -0.18581429,  1.36099780, ...,  0.67647684, -0.04699665,  1.51205540]]])\n",
      "tensor([[[-1.5417, -2.6153, -1.7988,  ..., -0.3140,  0.5651, -0.4452],\n",
      "         [-0.7949,  1.9116,  0.6657,  ...,  0.5483, -1.0147, -0.8492],\n",
      "         [-1.2256, -0.3623,  0.6506,  ...,  0.6573,  0.0556,  0.0901],\n",
      "         ...,\n",
      "         [ 0.3862, -0.7791,  0.9973,  ..., -1.3846, -3.3237, -1.3109],\n",
      "         [ 0.0558,  0.0689, -1.6666,  ..., -0.2335, -3.2937,  1.3056],\n",
      "         [ 1.9068,  1.9509, -0.2885,  ..., -0.0686,  0.9535,  1.0048]],\n",
      "\n",
      "        [[-0.9145,  0.5530, -1.0615,  ..., -0.8638,  1.0045,  1.2606],\n",
      "         [ 0.1022,  0.8130,  2.3687,  ...,  0.1682,  0.2924,  1.0541],\n",
      "         [-1.3320,  1.9443,  0.0193,  ...,  0.4804,  0.5159,  1.2289],\n",
      "         ...,\n",
      "         [-0.1956, -0.4708,  0.9080,  ..., -1.2860, -0.2432,  0.1773],\n",
      "         [ 0.8982, -1.3952,  0.1714,  ...,  2.3976,  1.7636, -0.5218],\n",
      "         [ 0.9412, -0.1858,  1.3610,  ...,  0.6765, -0.0470,  1.5121]]])\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/workspace/DeepSpeech-2.x/tools/venv-dev/lib/python3.7/site-packages/ipykernel/ipkernel.py:283: DeprecationWarning: `should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.\n",
      "  and should_run_async(code)\n"
     ]
    }
   ],
   "source": [
    "x = np.random.randn(2, 51, 256)\n",
    "print(x.dtype)\n",
    "px = paddle.to_tensor(x, dtype='float32')\n",
    "tx = torch.tensor(x, dtype=torch.float32)\n",
    "print(px)\n",
    "print(tx)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cooked-progressive",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "mechanical-prisoner",
   "metadata": {},
   "outputs": [],
   "source": [
    "data = np.load('enc_0_ff_out.npz', allow_pickle=True)\n",
    "t_norm_ff = data['norm_ff']\n",
    "t_ff_out = data['ff_out']\n",
    "t_ff_l_x = data['ff_l_x']\n",
    "t_ff_l_a_x = data['ff_l_a_x']\n",
    "t_ff_l_a_l_x = data['ff_l_a_l_x']\n",
    "t_ps = data['ps']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "indie-marriage",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "assured-zambia",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "True\n",
      "True\n",
      "True\n",
      "True\n"
     ]
    }
   ],
   "source": [
    "L.set_state_dict({'weight': t_ps[0].T, 'bias': t_ps[1]})\n",
    "L2.set_state_dict({'weight': t_ps[2].T, 'bias': t_ps[3]})\n",
    "\n",
    "ps = []\n",
    "for n, p in L.named_parameters():\n",
    "   ps.append(p)\n",
    "\n",
    "for n, p in L2.state_dict().items():\n",
    "    ps.append(p)\n",
    "    \n",
    "for p, tp in zip(ps, t_ps):\n",
    "    print(np.allclose(p.numpy(), tp.T))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "committed-jacob",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "extreme-traffic",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "optimum-milwaukee",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "viral-indian",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "True\n",
      "True\n",
      "True\n",
      "True\n"
     ]
    }
   ],
   "source": [
    "# data = np.load('enc_0_ff_out.npz', allow_pickle=True)\n",
    "# t_norm_ff = data['norm_ff']\n",
    "# t_ff_out = data['ff_out']\n",
    "# t_ff_l_x = data['ff_l_x']\n",
    "# t_ff_l_a_x = data['ff_l_a_x']\n",
    "# t_ff_l_a_l_x = data['ff_l_a_l_x']\n",
    "# t_ps = data['ps']\n",
    "TL = torch.nn.Linear(256, 2048)\n",
    "TL2 = torch.nn.Linear(2048, 256)\n",
    "TL.load_state_dict({'weight': torch.tensor(t_ps[0]), 'bias': torch.tensor(t_ps[1])})\n",
    "TL2.load_state_dict({'weight': torch.tensor(t_ps[2]), 'bias': torch.tensor(t_ps[3])})\n",
    "\n",
    "# for n, p in TL.named_parameters():\n",
    "#    print(n, p)\n",
    "# for n, p in TL2.named_parameters():\n",
    "#    print(n, p)\n",
    "\n",
    "ps = []\n",
    "for n, p in TL.state_dict().items():\n",
    "    ps.append(p.data.numpy())\n",
    "    \n",
    "for n, p in TL2.state_dict().items():\n",
    "    ps.append(p.data.numpy())\n",
    "    \n",
    "for p, tp in zip(ps, t_ps):\n",
    "    print(np.allclose(p, tp))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "skilled-vietnamese",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[[[ 0.67277956  0.08313607 -0.62761104 ... -0.17480263  0.42718208\n",
      "   -0.5787626 ]\n",
      "  [ 0.91516656  0.5393416   1.7159258  ...  0.06144593  0.06486575\n",
      "   -0.03350811]\n",
      "  [ 0.438351    0.6227843   0.24096036 ...  1.0912522  -0.90929437\n",
      "   -1.012989  ]\n",
      "  ...\n",
      "  [ 0.68631977  0.14240924  0.10763275 ... -0.11513516  0.48065388\n",
      "    0.04070369]\n",
      "  [-0.9525228   0.23197874  0.31264272 ...  0.5312439   0.18773697\n",
      "   -0.8450228 ]\n",
      "  [ 0.42024016 -0.04561988  0.54541194 ... -0.41933843 -0.00436018\n",
      "   -0.06663495]]\n",
      "\n",
      " [[-0.11638781 -0.33566502 -0.20887226 ...  0.17423287 -0.9195841\n",
      "   -0.8161046 ]\n",
      "  [-0.3469874   0.88269687 -0.11887559 ... -0.15566081  0.16357468\n",
      "   -0.20766167]\n",
      "  [-0.3847657   0.3984318  -0.06963477 ... -0.00360622  1.2360432\n",
      "   -0.26811332]\n",
      "  ...\n",
      "  [ 0.08230796 -0.46158582  0.54582864 ...  0.15747628 -0.44790155\n",
      "    0.06020184]\n",
      "  [-0.8095085   0.43163058 -0.42837143 ...  0.8627463   0.90656304\n",
      "    0.15847842]\n",
      "  [-1.485811   -0.18216592 -0.8882585  ...  0.32596245  0.7822631\n",
      "   -0.6460344 ]]]\n",
      "[[[ 0.67278004  0.08313602 -0.6276114  ... -0.17480245  0.42718196\n",
      "   -0.5787625 ]\n",
      "  [ 0.91516703  0.5393413   1.7159253  ...  0.06144581  0.06486579\n",
      "   -0.03350812]\n",
      "  [ 0.43835106  0.62278455  0.24096027 ...  1.0912521  -0.9092943\n",
      "   -1.0129892 ]\n",
      "  ...\n",
      "  [ 0.6863195   0.14240888  0.10763284 ... -0.11513527  0.48065376\n",
      "    0.04070365]\n",
      "  [-0.9525231   0.23197863  0.31264275 ...  0.53124386  0.18773702\n",
      "   -0.84502304]\n",
      "  [ 0.42024007 -0.04561983  0.545412   ... -0.41933888 -0.00436005\n",
      "   -0.066635  ]]\n",
      "\n",
      " [[-0.11638767 -0.33566508 -0.20887226 ...  0.17423296 -0.9195838\n",
      "   -0.8161046 ]\n",
      "  [-0.34698725  0.88269705 -0.11887549 ... -0.15566081  0.16357464\n",
      "   -0.20766166]\n",
      "  [-0.3847657   0.3984319  -0.06963488 ... -0.00360619  1.2360426\n",
      "   -0.26811326]\n",
      "  ...\n",
      "  [ 0.08230786 -0.4615857   0.5458287  ...  0.15747619 -0.44790167\n",
      "    0.06020182]\n",
      "  [-0.8095083   0.4316307  -0.42837155 ...  0.862746    0.9065631\n",
      "    0.15847899]\n",
      "  [-1.485811   -0.18216613 -0.8882584  ...  0.32596254  0.7822631\n",
      "   -0.6460344 ]]]\n",
      "True\n",
      "False\n"
     ]
    }
   ],
   "source": [
    "y = L(px)\n",
    "print(y.numpy())\n",
    "\n",
    "ty = TL(tx)\n",
    "print(ty.data.numpy())\n",
    "print(np.allclose(px.numpy(), tx.detach().numpy()))\n",
    "print(np.allclose(y.numpy(), ty.detach().numpy()))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "incorrect-allah",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "prostate-cameroon",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "governmental-surge",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[[ 0.04476918  0.554463   -0.3027508  ... -0.49600336  0.3751858\n",
      "   0.8254095 ]\n",
      " [ 0.95594174 -0.29528382 -1.2899452  ...  0.43718258  0.05584608\n",
      "  -0.06974669]]\n",
      "[[ 0.04476918  0.5544631  -0.3027507  ... -0.49600336  0.37518573\n",
      "   0.8254096 ]\n",
      " [ 0.95594174 -0.29528376 -1.2899454  ...  0.4371827   0.05584623\n",
      "  -0.0697467 ]]\n",
      "True\n",
      "False\n",
      "True\n"
     ]
    }
   ],
   "source": [
    "x = np.random.randn(2, 256)\n",
    "px = paddle.to_tensor(x, dtype='float32')\n",
    "tx = torch.tensor(x, dtype=torch.float32)\n",
    "y = L(px)\n",
    "print(y.numpy())\n",
    "ty = TL(tx)\n",
    "print(ty.data.numpy())\n",
    "print(np.allclose(px.numpy(), tx.detach().numpy()))\n",
    "print(np.allclose(y.numpy(), ty.detach().numpy()))\n",
    "print(np.allclose(y.numpy(), ty.detach().numpy(), atol=1e-5))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "confidential-jacket",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "improved-civilization",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "5e7e7c9fde8350084abf1898cf52651cfc84b17a\n"
     ]
    }
   ],
   "source": [
    "print(paddle.version.commit)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "d1e2d3b4",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['__builtins__',\n",
       " '__cached__',\n",
       " '__doc__',\n",
       " '__file__',\n",
       " '__loader__',\n",
       " '__name__',\n",
       " '__package__',\n",
       " '__spec__',\n",
       " 'commit',\n",
       " 'full_version',\n",
       " 'istaged',\n",
       " 'major',\n",
       " 'minor',\n",
       " 'mkl',\n",
       " 'patch',\n",
       " 'rc',\n",
       " 'show',\n",
       " 'with_mkl']"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dir(paddle.version)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "c880c719",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "2.1.0\n"
     ]
    }
   ],
   "source": [
    "print(paddle.version.full_version)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "f26977bf",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "commit: 5e7e7c9fde8350084abf1898cf52651cfc84b17a\n",
      "None\n"
     ]
    }
   ],
   "source": [
    "print(paddle.version.show())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "04ad47f6",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1.6.0\n"
     ]
    }
   ],
   "source": [
    "print(torch.__version__)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "e1e03830",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['__builtins__',\n",
       " '__cached__',\n",
       " '__doc__',\n",
       " '__file__',\n",
       " '__loader__',\n",
       " '__name__',\n",
       " '__package__',\n",
       " '__spec__',\n",
       " '__version__',\n",
       " 'cuda',\n",
       " 'debug',\n",
       " 'git_version',\n",
       " 'hip']"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dir(torch.version)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "id": "4ad0389b",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'b31f58de6fa8bbda5353b3c77d9be4914399724d'"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "torch.version.git_version"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "id": "7870ea10",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'10.2'"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "torch.version.cuda"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "db8ee5a7",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6321ec2a",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
 }
--- a/.notebook/compute_cmvn_loader_test.ipynb
+++ b/.notebook/compute_cmvn_loader_test.ipynb
--- a/.notebook/dataloader.ipynb
+++ b/.notebook/dataloader.ipynb
@ -338,7 +338,7 @@
    }
   ],
   "source": [
-    "for idx, (audio, text, audio_len, text_len) in enumerate(batch_reader()):\n",
+    "for idx, (audio, audio_len, text, text_len) in enumerate(batch_reader()):\n",
    "    print('test', text)\n",
    "    print(\"test raw\", ''.join( chr(i) for i in text[0][:int(text_len[0])] ))\n",
    "    print(\"test raw\", ''.join( chr(i) for i in text[-1][:int(text_len[-1])] ))\n",
@ -386,4 +386,4 @@
 },
 "nbformat": 4,
 "nbformat_minor": 5
-}
+}
--- a/.notebook/dataloader_with_tokens_tokenids.ipynb
+++ b/.notebook/dataloader_with_tokens_tokenids.ipynb
--- a/.notebook/hack_api_test.ipynb
+++ b/.notebook/hack_api_test.ipynb
@ -0,0 +1,290 @@
 {
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "breeding-haven",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "/home/ssd5/zhanghui/DeepSpeech2.x\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "'/home/ssd5/zhanghui/DeepSpeech2.x'"
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "%cd ..\n",
    "%pwd"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "appropriate-theta",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "LICENSE       deepspeech  examples\t\t    requirements.txt  tools\r\n",
      "README.md     docs\t  libsndfile-1.0.28\t    setup.sh\t      utils\r\n",
      "README_cn.md  env.sh\t  libsndfile-1.0.28.tar.gz  tests\r\n"
     ]
    }
   ],
   "source": [
    "!ls"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "entire-bloom",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/home/ssd5/zhanghui/DeepSpeech2.x/tools/venv/lib/python3.7/site-packages/paddle/fluid/layers/utils.py:26: DeprecationWarning: `np.int` is a deprecated alias for the builtin `int`. To silence this warning, use `int` by itself. Doing this will not modify any behavior and is safe. When replacing `np.int`, you may wish to use e.g. `np.int64` or `np.int32` to specify the precision. If you wish to review your current use, check the release note link for additional information.\n",
      "Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations\n",
      "  def convert_to_list(value, n, name, dtype=np.int):\n",
      "WARNING:root:override cat of paddle.Tensor if exists or register, remove this when fixed!\n",
      "WARNING:root:register user masked_fill to paddle.Tensor, remove this when fixed!\n",
      "WARNING:root:register user masked_fill_ to paddle.Tensor, remove this when fixed!\n",
      "WARNING:root:register user repeat to paddle.Tensor, remove this when fixed!\n",
      "WARNING:root:register user glu to paddle.nn.functional, remove this when fixed!\n",
      "WARNING:root:register user GLU to paddle.nn, remove this when fixed!\n",
      "WARNING:root:register user ConstantPad2d to paddle.nn, remove this when fixed!\n",
      "WARNING:root:override ctc_loss of paddle.nn.functional if exists, remove this when fixed!\n"
     ]
    }
   ],
   "source": [
    "from deepspeech.modules import loss"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "governmental-aircraft",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/home/ssd5/zhanghui/DeepSpeech2.x/tools/venv/lib/python3.7/site-packages/ipykernel/ipkernel.py:283: DeprecationWarning: `should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.\n",
      "  and should_run_async(code)\n"
     ]
    }
   ],
   "source": [
    "import paddle"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "proprietary-disaster",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<function deepspeech.modules.repeat(xs: paddle.VarBase, *size: Any) -> paddle.VarBase>"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "paddle.Tensor.repeat"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "first-diagram",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<property at 0x7fb515eeeb88>"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "paddle.Tensor.size"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "intelligent-david",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<function paddle.tensor.manipulation.concat(x, axis=0, name=None)>"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "paddle.Tensor.cat"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "bronze-tenant",
   "metadata": {},
   "outputs": [],
   "source": [
    "a = paddle.to_tensor([12,32, 10, 12, 123,32 ,4])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "balanced-bearing",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "7"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "a.size"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "id": "extreme-republic",
   "metadata": {},
   "outputs": [],
   "source": [
    "def size(xs: paddle.Tensor, *args: int) -> paddle.Tensor:\n",
    "    nargs = len(args)\n",
    "    assert (nargs <= 1)\n",
    "    s = paddle.shape(xs)\n",
    "    if nargs == 1:\n",
    "        return s[args[0]]\n",
    "    else:\n",
    "        return s\n",
    "\n",
    "# logger.warn(\n",
    "#     \"override size of paddle.Tensor if exists or register, remove this when fixed!\"\n",
    "# )\n",
    "paddle.Tensor.size = size"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "id": "gross-addiction",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Tensor(shape=[1], dtype=int32, place=CPUPlace, stop_gradient=True,\n",
       "       [7])"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "a.size(0)\n",
    "a.size()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "id": "adverse-dining",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Tensor(shape=[1], dtype=int32, place=CPUPlace, stop_gradient=True,\n",
       "       [7])"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "a.size()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "popular-potato",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
 }
--- a/.notebook/jit_infer.ipynb
+++ b/.notebook/jit_infer.ipynb
@ -0,0 +1,672 @@
 {
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "/home/ssd5/zhanghui/DeepSpeech2.x\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "'/home/ssd5/zhanghui/DeepSpeech2.x'"
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "%cd ..\n",
    "%pwd"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "2021-03-26 02:55:23,873 - WARNING - register user softmax to paddle, remove this when fixed!\n",
      "2021-03-26 02:55:23,875 - WARNING - register user sigmoid to paddle, remove this when fixed!\n",
      "2021-03-26 02:55:23,875 - WARNING - register user relu to paddle, remove this when fixed!\n",
      "2021-03-26 02:55:23,876 - WARNING - override cat of paddle if exists or register, remove this when fixed!\n",
      "2021-03-26 02:55:23,876 - WARNING - override eq of paddle.Tensor if exists or register, remove this when fixed!\n",
      "2021-03-26 02:55:23,877 - WARNING - override contiguous of paddle.Tensor if exists or register, remove this when fixed!\n",
      "2021-03-26 02:55:23,877 - WARNING - override size of paddle.Tensor (`to_static` do not process `size` property, maybe some `paddle` api dependent on it), remove this when fixed!\n",
      "2021-03-26 02:55:23,878 - WARNING - register user view to paddle.Tensor, remove this when fixed!\n",
      "2021-03-26 02:55:23,878 - WARNING - register user view_as to paddle.Tensor, remove this when fixed!\n",
      "2021-03-26 02:55:23,879 - WARNING - register user masked_fill to paddle.Tensor, remove this when fixed!\n",
      "2021-03-26 02:55:23,880 - WARNING - register user masked_fill_ to paddle.Tensor, remove this when fixed!\n",
      "2021-03-26 02:55:23,880 - WARNING - register user fill_ to paddle.Tensor, remove this when fixed!\n",
      "2021-03-26 02:55:23,881 - WARNING - register user repeat to paddle.Tensor, remove this when fixed!\n",
      "2021-03-26 02:55:23,881 - WARNING - register user softmax to paddle.Tensor, remove this when fixed!\n",
      "2021-03-26 02:55:23,882 - WARNING - register user sigmoid to paddle.Tensor, remove this when fixed!\n",
      "2021-03-26 02:55:23,882 - WARNING - register user relu to paddle.Tensor, remove this when fixed!\n",
      "2021-03-26 02:55:23,883 - WARNING - register user glu to paddle.nn.functional, remove this when fixed!\n",
      "2021-03-26 02:55:23,883 - WARNING - override ctc_loss of paddle.nn.functional if exists, remove this when fixed!\n",
      "2021-03-26 02:55:23,884 - WARNING - register user GLU to paddle.nn, remove this when fixed!\n",
      "2021-03-26 02:55:23,884 - WARNING - register user ConstantPad2d to paddle.nn, remove this when fixed!\n",
      "/home/ssd5/zhanghui/DeepSpeech2.x/tools/venv-dev/lib/python3.7/site-packages/scipy/fftpack/__init__.py:103: DeprecationWarning: The module numpy.dual is deprecated.  Instead of using dual, use the functions directly from numpy or scipy.\n",
      "  from numpy.dual import register_func\n",
      "/home/ssd5/zhanghui/DeepSpeech2.x/tools/venv-dev/lib/python3.7/site-packages/scipy/special/orthogonal.py:81: DeprecationWarning: `np.int` is a deprecated alias for the builtin `int`. To silence this warning, use `int` by itself. Doing this will not modify any behavior and is safe. When replacing `np.int`, you may wish to use e.g. `np.int64` or `np.int32` to specify the precision. If you wish to review your current use, check the release note link for additional information.\n",
      "Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations\n",
      "  from numpy import (exp, inf, pi, sqrt, floor, sin, cos, around, int,\n"
     ]
    }
   ],
   "source": [
    "import os\n",
    "import time\n",
    "import argparse\n",
    "import functools\n",
    "import paddle\n",
    "import numpy as np\n",
    "\n",
    "from deepspeech.utils.socket_server import warm_up_test\n",
    "from deepspeech.utils.socket_server import AsrTCPServer\n",
    "from deepspeech.utils.socket_server import AsrRequestHandler\n",
    "\n",
    "from deepspeech.training.cli import default_argument_parser\n",
    "from deepspeech.exps.deepspeech2.config import get_cfg_defaults\n",
    "\n",
    "from deepspeech.frontend.utility import read_manifest\n",
    "from deepspeech.utils.utility import add_arguments, print_arguments\n",
    "\n",
    "from deepspeech.models.deepspeech2 import DeepSpeech2Model\n",
    "from deepspeech.models.deepspeech2 import DeepSpeech2InferModel\n",
    "from deepspeech.io.dataset import ManifestDataset\n",
    "\n",
    "\n",
    "\n",
    "from deepspeech.frontend.utility import read_manifest"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.0.0\n",
      "e7f28d6c0db54eb9c9a810612300b526687e56a6\n",
      "OFF\n",
      "OFF\n",
      "commit: e7f28d6c0db54eb9c9a810612300b526687e56a6\n",
      "None\n",
      "0\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/home/ssd5/zhanghui/DeepSpeech2.x/tools/venv-dev/lib/python3.7/site-packages/ipykernel/ipkernel.py:283: DeprecationWarning: `should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.\n",
      "  and should_run_async(code)\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "['__builtins__',\n",
       " '__cached__',\n",
       " '__doc__',\n",
       " '__file__',\n",
       " '__loader__',\n",
       " '__name__',\n",
       " '__package__',\n",
       " '__spec__',\n",
       " 'commit',\n",
       " 'full_version',\n",
       " 'istaged',\n",
       " 'major',\n",
       " 'minor',\n",
       " 'mkl',\n",
       " 'patch',\n",
       " 'rc',\n",
       " 'show',\n",
       " 'with_mkl']"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "print(paddle.__version__)\n",
    "print(paddle.version.commit)\n",
    "print(paddle.version.with_mkl)\n",
    "print(paddle.version.mkl())\n",
    "print(paddle.version.show())\n",
    "print(paddle.version.patch)\n",
    "dir(paddle.version)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "data:\n",
      "  augmentation_config: conf/augmentation.config\n",
      "  batch_size: 64\n",
      "  dev_manifest: data/manifest.dev\n",
      "  keep_transcription_text: False\n",
      "  max_duration: 27.0\n",
      "  max_freq: None\n",
      "  mean_std_filepath: examples/aishell/data/mean_std.npz\n",
      "  min_duration: 0.0\n",
      "  n_fft: None\n",
      "  num_workers: 0\n",
      "  random_seed: 0\n",
      "  shuffle_method: batch_shuffle\n",
      "  sortagrad: True\n",
      "  specgram_type: linear\n",
      "  stride_ms: 10.0\n",
      "  target_dB: -20\n",
      "  target_sample_rate: 16000\n",
      "  test_manifest: examples/aishell/data/manifest.test\n",
      "  train_manifest: data/manifest.train\n",
      "  use_dB_normalization: True\n",
      "  vocab_filepath: examples/aishell/data/vocab.txt\n",
      "  window_ms: 20.0\n",
      "decoding:\n",
      "  alpha: 2.6\n",
      "  batch_size: 128\n",
      "  beam_size: 300\n",
      "  beta: 5.0\n",
      "  cutoff_prob: 0.99\n",
      "  cutoff_top_n: 40\n",
      "  decoding_method: ctc_beam_search\n",
      "  error_rate_type: cer\n",
      "  lang_model_path: data/lm/zh_giga.no_cna_cmn.prune01244.klm\n",
      "  num_proc_bsearch: 10\n",
      "model:\n",
      "  num_conv_layers: 2\n",
      "  num_rnn_layers: 3\n",
      "  rnn_layer_size: 1024\n",
      "  share_rnn_weights: False\n",
      "  use_gru: True\n",
      "training:\n",
      "  global_grad_clip: 5.0\n",
      "  lr: 0.0005\n",
      "  lr_decay: 0.83\n",
      "  n_epoch: 30\n",
      "  weight_decay: 1e-06\n",
      "-----------  Configuration Arguments -----------\n",
      "checkpoint_path: examples/aishell/ckpt-loss2e-3-0.83-5/checkpoints/step-11725\n",
      "config: examples/aishell/conf/deepspeech2.yaml\n",
      "device: gpu\n",
      "dump_config: None\n",
      "export_path: None\n",
      "host_ip: localhost\n",
      "host_port: 8086\n",
      "model_dir: None\n",
      "model_file: examples/aishell/jit.model.pdmodel\n",
      "nprocs: 1\n",
      "opts: ['data.test_manifest', 'examples/aishell/data/manifest.test', 'data.mean_std_filepath', 'examples/aishell/data/mean_std.npz', 'data.vocab_filepath', 'examples/aishell/data/vocab.txt']\n",
      "output: None\n",
      "params_file: examples/aishell/jit.model.pdiparams\n",
      "speech_save_dir: demo_cache\n",
      "use_gpu: False\n",
      "warmup_manifest: examples/aishell/data/manifest.test\n",
      "------------------------------------------------\n"
     ]
    }
   ],
   "source": [
    "parser = default_argument_parser()\n",
    "add_arg = functools.partial(add_arguments, argparser=parser)\n",
    "add_arg('host_ip',          str,\n",
    "        'localhost',\n",
    "        \"Server's IP address.\")\n",
    "add_arg('host_port',        int,    8086,    \"Server's IP port.\")\n",
    "add_arg('speech_save_dir',  str,\n",
    "        'demo_cache',\n",
    "        \"Directory to save demo audios.\")\n",
    "add_arg('warmup_manifest',  \n",
    "        str, \n",
    "        \"examples/aishell/data/manifest.test\", \n",
    "        \"Filepath of manifest to warm up.\")\n",
    "add_arg(\n",
    "    \"--model_file\",\n",
    "    type=str,\n",
    "    default=\"examples/aishell/jit.model.pdmodel\",\n",
    "    help=\"Model filename, Specify this when your model is a combined model.\"\n",
    ")\n",
    "add_arg(\n",
    "    \"--params_file\",\n",
    "    type=str,\n",
    "    default=\"examples/aishell/jit.model.pdiparams\",\n",
    "    help=\n",
    "    \"Parameter filename, Specify this when your model is a combined model.\"\n",
    ")\n",
    "add_arg(\n",
    "    \"--model_dir\",\n",
    "    type=str,\n",
    "    default=None,\n",
    "    help=\n",
    "    \"Model dir, If you load a non-combined model, specify the directory of the model.\"\n",
    ")\n",
    "add_arg(\"--use_gpu\",type=bool,default=False, help=\"Whether use gpu.\")\n",
    "\n",
    "\n",
    "args = parser.parse_args(\n",
    "    \"--checkpoint_path examples/aishell/ckpt-loss2e-3-0.83-5/checkpoints/step-11725 --config examples/aishell/conf/deepspeech2.yaml --opts data.test_manifest examples/aishell/data/manifest.test data.mean_std_filepath examples/aishell/data/mean_std.npz  data.vocab_filepath examples/aishell/data/vocab.txt\".split()\n",
    ")\n",
    "\n",
    "\n",
    "config = get_cfg_defaults()\n",
    "if args.config:\n",
    "    config.merge_from_file(args.config)\n",
    "if args.opts:\n",
    "    config.merge_from_list(args.opts)\n",
    "config.freeze()\n",
    "print(config)\n",
    "\n",
    "args.warmup_manifest = config.data.test_manifest\n",
    "\n",
    "print_arguments(args)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "dataset = ManifestDataset(\n",
    "        config.data.test_manifest,\n",
    "        config.data.unit_type,\n",
    "        config.data.vocab_filepath,\n",
    "        config.data.mean_std_filepath,\n",
    "        augmentation_config=\"{}\",\n",
    "        max_duration=config.data.max_duration,\n",
    "        min_duration=config.data.min_duration,\n",
    "        stride_ms=config.data.stride_ms,\n",
    "        window_ms=config.data.window_ms,\n",
    "        n_fft=config.data.n_fft,\n",
    "        max_freq=config.data.max_freq,\n",
    "        target_sample_rate=config.data.target_sample_rate,\n",
    "        specgram_type=config.data.specgram_type,\n",
    "        feat_dim=config.data.feat_dim,\n",
    "        delta_delta=config.data.delat_delta,\n",
    "        use_dB_normalization=config.data.use_dB_normalization,\n",
    "        target_dB=config.data.target_dB,\n",
    "        random_seed=config.data.random_seed,\n",
    "        keep_transcription_text=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "2021-03-26 02:55:57,930 - INFO - [checkpoint] Rank 0: loaded model from examples/aishell/ckpt-loss2e-3-0.83-5/checkpoints/step-11725.pdparams\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "layer summary:\n",
      "encoder.conv.conv_in.conv.weight|[32, 1, 41, 11]|14432\n",
      "encoder.conv.conv_in.bn.weight|[32]|32\n",
      "encoder.conv.conv_in.bn.bias|[32]|32\n",
      "encoder.conv.conv_in.bn._mean|[32]|32\n",
      "encoder.conv.conv_in.bn._variance|[32]|32\n",
      "encoder.conv.conv_stack.0.conv.weight|[32, 32, 21, 11]|236544\n",
      "encoder.conv.conv_stack.0.bn.weight|[32]|32\n",
      "encoder.conv.conv_stack.0.bn.bias|[32]|32\n",
      "encoder.conv.conv_stack.0.bn._mean|[32]|32\n",
      "encoder.conv.conv_stack.0.bn._variance|[32]|32\n",
      "encoder.rnn.rnn_stacks.0.fw_fc.weight|[1312, 3072]|4030464\n",
      "encoder.rnn.rnn_stacks.0.fw_bn.weight|[3072]|3072\n",
      "encoder.rnn.rnn_stacks.0.fw_bn.bias|[3072]|3072\n",
      "encoder.rnn.rnn_stacks.0.fw_bn._mean|[3072]|3072\n",
      "encoder.rnn.rnn_stacks.0.fw_bn._variance|[3072]|3072\n",
      "encoder.rnn.rnn_stacks.0.bw_fc.weight|[1312, 3072]|4030464\n",
      "encoder.rnn.rnn_stacks.0.bw_bn.weight|[3072]|3072\n",
      "encoder.rnn.rnn_stacks.0.bw_bn.bias|[3072]|3072\n",
      "encoder.rnn.rnn_stacks.0.bw_bn._mean|[3072]|3072\n",
      "encoder.rnn.rnn_stacks.0.bw_bn._variance|[3072]|3072\n",
      "encoder.rnn.rnn_stacks.0.fw_cell.weight_hh|[3072, 1024]|3145728\n",
      "encoder.rnn.rnn_stacks.0.fw_cell.bias_hh|[3072]|3072\n",
      "encoder.rnn.rnn_stacks.0.bw_cell.weight_hh|[3072, 1024]|3145728\n",
      "encoder.rnn.rnn_stacks.0.bw_cell.bias_hh|[3072]|3072\n",
      "encoder.rnn.rnn_stacks.0.fw_rnn.cell.weight_hh|[3072, 1024]|3145728\n",
      "encoder.rnn.rnn_stacks.0.fw_rnn.cell.bias_hh|[3072]|3072\n",
      "encoder.rnn.rnn_stacks.0.bw_rnn.cell.weight_hh|[3072, 1024]|3145728\n",
      "encoder.rnn.rnn_stacks.0.bw_rnn.cell.bias_hh|[3072]|3072\n",
      "encoder.rnn.rnn_stacks.1.fw_fc.weight|[2048, 3072]|6291456\n",
      "encoder.rnn.rnn_stacks.1.fw_bn.weight|[3072]|3072\n",
      "encoder.rnn.rnn_stacks.1.fw_bn.bias|[3072]|3072\n",
      "encoder.rnn.rnn_stacks.1.fw_bn._mean|[3072]|3072\n",
      "encoder.rnn.rnn_stacks.1.fw_bn._variance|[3072]|3072\n",
      "encoder.rnn.rnn_stacks.1.bw_fc.weight|[2048, 3072]|6291456\n",
      "encoder.rnn.rnn_stacks.1.bw_bn.weight|[3072]|3072\n",
      "encoder.rnn.rnn_stacks.1.bw_bn.bias|[3072]|3072\n",
      "encoder.rnn.rnn_stacks.1.bw_bn._mean|[3072]|3072\n",
      "encoder.rnn.rnn_stacks.1.bw_bn._variance|[3072]|3072\n",
      "encoder.rnn.rnn_stacks.1.fw_cell.weight_hh|[3072, 1024]|3145728\n",
      "encoder.rnn.rnn_stacks.1.fw_cell.bias_hh|[3072]|3072\n",
      "encoder.rnn.rnn_stacks.1.bw_cell.weight_hh|[3072, 1024]|3145728\n",
      "encoder.rnn.rnn_stacks.1.bw_cell.bias_hh|[3072]|3072\n",
      "encoder.rnn.rnn_stacks.1.fw_rnn.cell.weight_hh|[3072, 1024]|3145728\n",
      "encoder.rnn.rnn_stacks.1.fw_rnn.cell.bias_hh|[3072]|3072\n",
      "encoder.rnn.rnn_stacks.1.bw_rnn.cell.weight_hh|[3072, 1024]|3145728\n",
      "encoder.rnn.rnn_stacks.1.bw_rnn.cell.bias_hh|[3072]|3072\n",
      "encoder.rnn.rnn_stacks.2.fw_fc.weight|[2048, 3072]|6291456\n",
      "encoder.rnn.rnn_stacks.2.fw_bn.weight|[3072]|3072\n",
      "encoder.rnn.rnn_stacks.2.fw_bn.bias|[3072]|3072\n",
      "encoder.rnn.rnn_stacks.2.fw_bn._mean|[3072]|3072\n",
      "encoder.rnn.rnn_stacks.2.fw_bn._variance|[3072]|3072\n",
      "encoder.rnn.rnn_stacks.2.bw_fc.weight|[2048, 3072]|6291456\n",
      "encoder.rnn.rnn_stacks.2.bw_bn.weight|[3072]|3072\n",
      "encoder.rnn.rnn_stacks.2.bw_bn.bias|[3072]|3072\n",
      "encoder.rnn.rnn_stacks.2.bw_bn._mean|[3072]|3072\n",
      "encoder.rnn.rnn_stacks.2.bw_bn._variance|[3072]|3072\n",
      "encoder.rnn.rnn_stacks.2.fw_cell.weight_hh|[3072, 1024]|3145728\n",
      "encoder.rnn.rnn_stacks.2.fw_cell.bias_hh|[3072]|3072\n",
      "encoder.rnn.rnn_stacks.2.bw_cell.weight_hh|[3072, 1024]|3145728\n",
      "encoder.rnn.rnn_stacks.2.bw_cell.bias_hh|[3072]|3072\n",
      "encoder.rnn.rnn_stacks.2.fw_rnn.cell.weight_hh|[3072, 1024]|3145728\n",
      "encoder.rnn.rnn_stacks.2.fw_rnn.cell.bias_hh|[3072]|3072\n",
      "encoder.rnn.rnn_stacks.2.bw_rnn.cell.weight_hh|[3072, 1024]|3145728\n",
      "encoder.rnn.rnn_stacks.2.bw_rnn.cell.bias_hh|[3072]|3072\n",
      "decoder.ctc_lo.weight|[2048, 4300]|8806400\n",
      "decoder.ctc_lo.bias|[4300]|4300\n",
      "layer has 66 parameters, 80148012 elements.\n"
     ]
    }
   ],
   "source": [
    "model = DeepSpeech2InferModel.from_pretrained(dataset, config,\n",
    "                                             args.checkpoint_path)\n",
    "model.eval()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "examples/aishell/jit.model.pdmodel\n",
      "examples/aishell/jit.model.pdiparams\n",
      "0\n",
      "False\n"
     ]
    }
   ],
   "source": [
    "\n",
    "from paddle.inference import Config\n",
    "from paddle.inference import PrecisionType\n",
    "from paddle.inference import create_predictor\n",
    "\n",
    "args.use_gpu=False\n",
    "paddle.set_device('cpu')\n",
    "\n",
    "def init_predictor(args):\n",
    "    if args.model_dir is not None:\n",
    "        config = Config(args.model_dir)\n",
    "    else:\n",
    "        config = Config(args.model_file, args.params_file)\n",
    "\n",
    "    if args.use_gpu:\n",
    "        config.enable_use_gpu(memory_pool_init_size_mb=1000, device_id=0)\n",
    "#         config.enable_tensorrt_engine(precision_mode=PrecisionType.Float32,\n",
    "#                               use_calib_mode=True) # 开启TensorRT预测，精度为fp32，开启int8离线量化\n",
    "    else:\n",
    "        # If not specific mkldnn, you can set the blas thread.\n",
    "        # The thread num should not be greater than the number of cores in the CPU.\n",
    "        config.set_cpu_math_library_num_threads(1)\n",
    "        config.enable_mkldnn()\n",
    "        \n",
    "    config.enable_memory_optim()\n",
    "    config.switch_ir_optim(True)\n",
    "    \n",
    "    print(config.model_dir())\n",
    "    print(config.prog_file())\n",
    "    print(config.params_file())\n",
    "    print(config.gpu_device_id())\n",
    "    print(args.use_gpu)\n",
    "    predictor = create_predictor(config)\n",
    "    return predictor\n",
    "\n",
    "def run(predictor, audio, audio_len):\n",
    "    # copy img data to input tensor\n",
    "    input_names = predictor.get_input_names()\n",
    "    for i, name in enumerate(input_names):\n",
    "        print(\"input:\", i, name)\n",
    "        \n",
    "    audio_tensor = predictor.get_input_handle('audio')\n",
    "    audio_tensor.reshape(audio.shape)\n",
    "    audio_tensor.copy_from_cpu(audio.copy())\n",
    "    \n",
    "    audiolen_tensor = predictor.get_input_handle('audio_len')\n",
    "    audiolen_tensor.reshape(audio_len.shape)\n",
    "    audiolen_tensor.copy_from_cpu(audio_len.copy())\n",
    "\n",
    "    output_names = predictor.get_output_names()\n",
    "    for i, name in enumerate(output_names):\n",
    "        print(\"output:\", i, name)\n",
    "\n",
    "    # do the inference\n",
    "    predictor.run()\n",
    "\n",
    "    results = []\n",
    "    # get out data from output tensor\n",
    "    output_names = predictor.get_output_names()\n",
    "    for i, name in enumerate(output_names):\n",
    "        output_tensor = predictor.get_output_handle(name)\n",
    "        output_data = output_tensor.copy_to_cpu()\n",
    "        results.append(output_data)\n",
    "\n",
    "    return results\n",
    "\n",
    "\n",
    "predictor = init_predictor(args)\n",
    "\n",
    "def file_to_transcript(filename):\n",
    "        print(filename)\n",
    "        feature = dataset.process_utterance(filename, \"\")\n",
    "        audio = np.array([feature[0]]).astype('float32')  #[1, D, T]\n",
    "        audio_len = feature[0].shape[1]\n",
    "        audio_len = np.array([audio_len]).astype('int64')  # [1]\n",
    "        \n",
    "        \n",
    "        i_probs = run(predictor, audio, audio_len)\n",
    "        print('jit:', i_probs[0], type(i_probs[0]))\n",
    "        \n",
    "        audio = paddle.to_tensor(audio)\n",
    "        audio_len = paddle.to_tensor(audio_len)\n",
    "        print(audio.shape)\n",
    "        print(audio_len.shape)\n",
    "        \n",
    "        #eouts, eouts_len = model.encoder(audio, audio_len)\n",
    "        #probs = model.decoder.softmax(eouts)\n",
    "        probs = model.forward(audio, audio_len)\n",
    "        print('paddle:', probs.numpy())\n",
    "        \n",
    "        flag = np.allclose(i_probs[0], probs.numpy())\n",
    "        print(flag)\n",
    "        \n",
    "        return probs\n",
    "\n",
    "#         result_transcript = model.decode(\n",
    "#             audio,\n",
    "#             audio_len,\n",
    "#             vocab_list=dataset.vocab_list,\n",
    "#             decoding_method=config.decoding.decoding_method,\n",
    "#             lang_model_path=config.decoding.lang_model_path,\n",
    "#             beam_alpha=config.decoding.alpha,\n",
    "#             beam_beta=config.decoding.beta,\n",
    "#             beam_size=config.decoding.beam_size,\n",
    "#             cutoff_prob=config.decoding.cutoff_prob,\n",
    "#             cutoff_top_n=config.decoding.cutoff_top_n,\n",
    "#             num_processes=config.decoding.num_proc_bsearch)\n",
    "#         return result_transcript[0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Warm-up Test Case %d: %s 0 /home/ssd5/zhanghui/DeepSpeech2.x/examples/aishell/../dataset/aishell/data_aishell/wav/test/S0764/BAC009S0764W0124.wav\n",
      "/home/ssd5/zhanghui/DeepSpeech2.x/examples/aishell/../dataset/aishell/data_aishell/wav/test/S0764/BAC009S0764W0124.wav\n",
      "input: 0 audio\n",
      "input: 1 audio_len\n",
      "output: 0 tmp_75\n",
      "jit: [[[8.91786298e-12 4.45648032e-12 3.67572750e-09 ... 8.91767563e-12\n",
      "   8.91573707e-12 4.64317296e-08]\n",
      "  [1.55950222e-15 2.62794089e-14 4.50423509e-12 ... 1.55944271e-15\n",
      "   1.55891342e-15 9.99992609e-01]\n",
      "  [1.24638127e-17 7.61802427e-16 2.93265812e-14 ... 1.24633371e-17\n",
      "   1.24587264e-17 1.00000000e+00]\n",
      "  ...\n",
      "  [4.37488240e-15 2.43676260e-12 1.98770514e-12 ... 4.37479896e-15\n",
      "   4.37354747e-15 1.00000000e+00]\n",
      "  [3.89334696e-13 1.66754856e-11 1.42900388e-11 ... 3.89329492e-13\n",
      "   3.89252270e-13 1.00000000e+00]\n",
      "  [1.00349985e-10 2.56293708e-10 2.91177582e-10 ... 1.00347876e-10\n",
      "   1.00334095e-10 9.99998808e-01]]] <class 'numpy.ndarray'>\n",
      "[1, 161, 522]\n",
      "[1]\n",
      "paddle: [[[8.91789680e-12 4.45649724e-12 3.67574149e-09 ... 8.91770945e-12\n",
      "   8.91577090e-12 4.64319072e-08]\n",
      "  [1.55950222e-15 2.62794089e-14 4.50423509e-12 ... 1.55944271e-15\n",
      "   1.55891342e-15 9.99992609e-01]\n",
      "  [1.24638599e-17 7.61805339e-16 2.93267472e-14 ... 1.24633842e-17\n",
      "   1.24587735e-17 1.00000000e+00]\n",
      "  ...\n",
      "  [4.37488240e-15 2.43676737e-12 1.98770514e-12 ... 4.37479896e-15\n",
      "   4.37354747e-15 1.00000000e+00]\n",
      "  [3.89336187e-13 1.66755481e-11 1.42900925e-11 ... 3.89330983e-13\n",
      "   3.89253761e-13 1.00000000e+00]\n",
      "  [1.00349985e-10 2.56293708e-10 2.91177582e-10 ... 1.00347876e-10\n",
      "   1.00334095e-10 9.99998808e-01]]]\n",
      "False\n"
     ]
    }
   ],
   "source": [
    "manifest = read_manifest(args.warmup_manifest)\n",
    "\n",
    "for idx, sample in enumerate(manifest[:1]):\n",
    "    print(\"Warm-up Test Case %d: %s\", idx, sample['audio_filepath'])\n",
    "    start_time = time.time()\n",
    "    transcript = file_to_transcript(sample['audio_filepath'])\n",
    "    finish_time = time.time()\n",
    "#     print(\"Response Time: %f, Transcript: %s\" %\n",
    "#           (finish_time - start_time, transcript))\n",
    "    break"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(1, 161, 522) (1,)\n",
      "input: 0 audio\n",
      "input: 1 audio_len\n",
      "output: 0 tmp_75\n",
      "jit: [[[8.91789680e-12 4.45649724e-12 3.67574149e-09 ... 8.91770945e-12\n",
      "   8.91577090e-12 4.64319072e-08]\n",
      "  [1.55950222e-15 2.62794089e-14 4.50423509e-12 ... 1.55944271e-15\n",
      "   1.55891342e-15 9.99992609e-01]\n",
      "  [1.24638599e-17 7.61805339e-16 2.93267472e-14 ... 1.24633842e-17\n",
      "   1.24587735e-17 1.00000000e+00]\n",
      "  ...\n",
      "  [4.37488240e-15 2.43676737e-12 1.98770514e-12 ... 4.37479896e-15\n",
      "   4.37354747e-15 1.00000000e+00]\n",
      "  [3.89336187e-13 1.66755481e-11 1.42900925e-11 ... 3.89330983e-13\n",
      "   3.89253761e-13 1.00000000e+00]\n",
      "  [1.00349985e-10 2.56293708e-10 2.91177582e-10 ... 1.00347876e-10\n",
      "   1.00334095e-10 9.99998808e-01]]]\n"
     ]
    }
   ],
   "source": [
    "def test(filename):\n",
    "    feature = dataset.process_utterance(filename, \"\")\n",
    "    audio = np.array([feature[0]]).astype('float32')  #[1, D, T]\n",
    "    audio_len = feature[0].shape[1]\n",
    "    audio_len = np.array([audio_len]).astype('int64')  # [1]\n",
    "    \n",
    "    print(audio.shape, audio_len.shape)\n",
    "\n",
    "    i_probs = run(predictor, audio, audio_len)\n",
    "    print('jit:', i_probs[0])\n",
    "    return i_probs\n",
    "    \n",
    "probs = test(sample['audio_filepath'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
 }
--- a/.notebook/layer_norm_test.ipynb
+++ b/.notebook/layer_norm_test.ipynb
@ -0,0 +1,229 @@
 {
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 32,
   "id": "academic-surname",
   "metadata": {},
   "outputs": [],
   "source": [
    "import paddle\n",
    "from paddle import nn"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "id": "fundamental-treasure",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Parameter containing:\n",
      "Tensor(shape=[256], dtype=float32, place=CUDAPlace(0), stop_gradient=False,\n",
      "       [1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])\n",
      "Parameter containing:\n",
      "Tensor(shape=[256], dtype=float32, place=CUDAPlace(0), stop_gradient=False,\n",
      "       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])\n"
     ]
    }
   ],
   "source": [
    "L = nn.LayerNorm(256, epsilon=1e-12)\n",
    "for p in L.parameters():\n",
    "    print(p)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "id": "consolidated-elephant",
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "id": "moderate-noise",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "float64\n"
     ]
    }
   ],
   "source": [
    "x = np.random.randn(2, 51, 256)\n",
    "print(x.dtype)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "id": "cooked-progressive",
   "metadata": {},
   "outputs": [],
   "source": [
    "y = L(paddle.to_tensor(x, dtype='float32'))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 48,
   "id": "optimum-milwaukee",
   "metadata": {},
   "outputs": [],
   "source": [
    "import torch"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "id": "viral-indian",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Parameter containing:\n",
      "tensor([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,\n",
      "        1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,\n",
      "        1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,\n",
      "        1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,\n",
      "        1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,\n",
      "        1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,\n",
      "        1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,\n",
      "        1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,\n",
      "        1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,\n",
      "        1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,\n",
      "        1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,\n",
      "        1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,\n",
      "        1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,\n",
      "        1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,\n",
      "        1., 1., 1., 1.], requires_grad=True)\n",
      "Parameter containing:\n",
      "tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,\n",
      "        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,\n",
      "        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,\n",
      "        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,\n",
      "        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,\n",
      "        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,\n",
      "        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,\n",
      "        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,\n",
      "        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,\n",
      "        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,\n",
      "        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],\n",
      "       requires_grad=True)\n"
     ]
    }
   ],
   "source": [
    "TL = torch.nn.LayerNorm(256, eps=1e-12)\n",
    "for p in TL.parameters():\n",
    "    print(p)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "id": "skilled-vietnamese",
   "metadata": {},
   "outputs": [],
   "source": [
    "ty = TL(torch.tensor(x, dtype=torch.float32))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 51,
   "id": "incorrect-allah",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "False"
      ]
     },
     "execution_count": 51,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "np.allclose(y.numpy(), ty.detach().numpy())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "prostate-cameroon",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": 52,
   "id": "governmental-surge",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 52,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "x = np.random.randn(2, 256)\n",
    "y = L(paddle.to_tensor(x, dtype='float32'))\n",
    "ty = TL(torch.tensor(x, dtype=torch.float32))\n",
    "np.allclose(y.numpy(), ty.detach().numpy())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "confidential-jacket",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
 }
--- a/.notebook/mask_and_masked_fill_test.ipynb
+++ b/.notebook/mask_and_masked_fill_test.ipynb
@ -0,0 +1,449 @@
 {
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "primary-organic",
   "metadata": {},
   "outputs": [],
   "source": [
    "import torch"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "id": "stopped-semester",
   "metadata": {},
   "outputs": [],
   "source": [
    "def mask_finished_scores(score: torch.Tensor,\n",
    "                         flag: torch.Tensor) -> torch.Tensor:\n",
    "    \"\"\"\n",
    "    If a sequence is finished, we only allow one alive branch. This function\n",
    "    aims to give one branch a zero score and the rest -inf score.\n",
    "    Args:\n",
    "        score (torch.Tensor): A real value array with shape\n",
    "            (batch_size * beam_size, beam_size).\n",
    "        flag (torch.Tensor): A bool array with shape\n",
    "            (batch_size * beam_size, 1).\n",
    "    Returns:\n",
    "        torch.Tensor: (batch_size * beam_size, beam_size).\n",
    "    \"\"\"\n",
    "    beam_size = score.size(-1)\n",
    "    zero_mask = torch.zeros_like(flag, dtype=torch.bool)\n",
    "    if beam_size > 1:\n",
    "        unfinished = torch.cat((zero_mask, flag.repeat([1, beam_size - 1])),\n",
    "                               dim=1)\n",
    "        finished = torch.cat((flag, zero_mask.repeat([1, beam_size - 1])),\n",
    "                             dim=1)\n",
    "    else:\n",
    "        unfinished = zero_mask\n",
    "        finished = flag\n",
    "    print(unfinished)\n",
    "    print(finished)\n",
    "    score.masked_fill_(unfinished, -float('inf'))\n",
    "    score.masked_fill_(finished, 0)\n",
    "    return score"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 58,
   "id": "agreed-portuguese",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "tensor([[ True],\n",
      "        [False]])\n",
      "tensor([[-0.8841,  0.7381, -0.9986],\n",
      "        [ 0.2675, -0.7971,  0.3798]])\n",
      "tensor([[ True,  True],\n",
      "        [False, False]])\n"
     ]
    }
   ],
   "source": [
    "score = torch.randn((2, 3))\n",
    "flag = torch.ones((2, 1), dtype=torch.bool)\n",
    "flag[1] = False\n",
    "print(flag)\n",
    "print(score)\n",
    "print(flag.repeat([1, 2]))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 59,
   "id": "clean-aspect",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "tensor([[False,  True,  True],\n",
      "        [False, False, False]])\n",
      "tensor([[ True, False, False],\n",
      "        [False, False, False]])\n",
      "tensor([[ 0.0000,    -inf,    -inf],\n",
      "        [ 0.2675, -0.7971,  0.3798]])\n",
      "tensor([[ 0.0000,    -inf,    -inf],\n",
      "        [ 0.2675, -0.7971,  0.3798]])\n"
     ]
    }
   ],
   "source": [
    "r  = mask_finished_scores(score, flag)\n",
    "print(r)\n",
    "print(score)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 55,
   "id": "thrown-airline",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Tensor(shape=[2, 1], dtype=bool, place=CUDAPlace(0), stop_gradient=True,\n",
      "       [[True ],\n",
      "        [False]])\n",
      "Tensor(shape=[2, 3], dtype=float32, place=CUDAPlace(0), stop_gradient=True,\n",
      "       [[ 2.05994511,  1.87704289,  0.01988174],\n",
      "        [-0.40165186,  0.77547729, -0.64469045]])\n",
      "Tensor(shape=[2, 2], dtype=bool, place=CUDAPlace(0), stop_gradient=True,\n",
      "       [[True , True ],\n",
      "        [False, False]])\n"
     ]
    }
   ],
   "source": [
    "import paddle\n",
    "\n",
    "score = paddle.randn((2, 3))\n",
    "flag = paddle.ones((2, 1), dtype='bool')\n",
    "flag[1] = False\n",
    "print(flag)\n",
    "print(score)\n",
    "print(flag.tile([1, 2]))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 56,
   "id": "internal-patent",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Tensor(shape=[2, 3], dtype=bool, place=CUDAPlace(0), stop_gradient=True,\n",
      "       [[False, True , True ],\n",
      "        [False, False, False]])\n",
      "Tensor(shape=[2, 3], dtype=bool, place=CUDAPlace(0), stop_gradient=True,\n",
      "       [[True , False, False],\n",
      "        [False, False, False]])\n",
      "x Tensor(shape=[2, 3], dtype=float32, place=CUDAPlace(0), stop_gradient=True,\n",
      "       [[ 2.05994511,  1.87704289,  0.01988174],\n",
      "        [-0.40165186,  0.77547729, -0.64469045]])\n",
      "2 Tensor(shape=[2, 3], dtype=float32, place=CUDAPlace(0), stop_gradient=True,\n",
      "       [[ 2.05994511,  1.87704289,  0.01988174],\n",
      "        [-0.40165186,  0.77547729, -0.64469045]])\n",
      "3 Tensor(shape=[2, 3], dtype=float32, place=CUDAPlace(0), stop_gradient=True,\n",
      "       [[ 2.05994511, -inf.      , -inf.      ],\n",
      "        [-0.40165186,  0.77547729, -0.64469045]])\n",
      "x Tensor(shape=[2, 3], dtype=float32, place=CUDAPlace(0), stop_gradient=True,\n",
      "       [[ 2.05994511, -inf.      , -inf.      ],\n",
      "        [-0.40165186,  0.77547729, -0.64469045]])\n",
      "2 Tensor(shape=[2, 3], dtype=float32, place=CUDAPlace(0), stop_gradient=True,\n",
      "       [[ 2.05994511, -inf.      , -inf.      ],\n",
      "        [-0.40165186,  0.77547729, -0.64469045]])\n",
      "3 Tensor(shape=[2, 3], dtype=float32, place=CUDAPlace(0), stop_gradient=True,\n",
      "       [[ 0.        , -inf.      , -inf.      ],\n",
      "        [-0.40165186,  0.77547729, -0.64469045]])\n",
      "Tensor(shape=[2, 3], dtype=float32, place=CUDAPlace(0), stop_gradient=True,\n",
      "       [[ 0.        , -inf.      , -inf.      ],\n",
      "        [-0.40165186,  0.77547729, -0.64469045]])\n"
     ]
    }
   ],
   "source": [
    "paddle.bool = 'bool'\n",
    "\n",
    "def masked_fill(xs:paddle.Tensor, mask:paddle.Tensor, value:float):\n",
    "    print(xs)\n",
    "    trues = paddle.ones_like(xs) * value\n",
    "    assert xs.shape == mask.shape\n",
    "    xs = paddle.where(mask, trues, xs)\n",
    "    return xs\n",
    "\n",
    "def masked_fill_(xs:paddle.Tensor, mask:paddle.Tensor, value:float):\n",
    "    print('x', xs)\n",
    "    trues = paddle.ones_like(xs) * value\n",
    "    assert xs.shape == mask.shape\n",
    "    ret = paddle.where(mask, trues, xs)\n",
    "    print('2', xs)\n",
    "    paddle.assign(ret, output=xs)\n",
    "    print('3', xs)\n",
    "\n",
    "paddle.Tensor.masked_fill = masked_fill\n",
    "paddle.Tensor.masked_fill_ = masked_fill_\n",
    "\n",
    "def mask_finished_scores_pd(score: paddle.Tensor,\n",
    "                         flag: paddle.Tensor) -> paddle.Tensor:\n",
    "    \"\"\"\n",
    "    If a sequence is finished, we only allow one alive branch. This function\n",
    "    aims to give one branch a zero score and the rest -inf score.\n",
    "    Args:\n",
    "        score (torch.Tensor): A real value array with shape\n",
    "            (batch_size * beam_size, beam_size).\n",
    "        flag (torch.Tensor): A bool array with shape\n",
    "            (batch_size * beam_size, 1).\n",
    "    Returns:\n",
    "        torch.Tensor: (batch_size * beam_size, beam_size).\n",
    "    \"\"\"\n",
    "    beam_size = score.shape[-1]\n",
    "    zero_mask = paddle.zeros_like(flag, dtype=paddle.bool)\n",
    "    if beam_size > 1:\n",
    "        unfinished = paddle.concat((zero_mask, flag.tile([1, beam_size - 1])),\n",
    "                               axis=1)\n",
    "        finished = paddle.concat((flag, zero_mask.tile([1, beam_size - 1])),\n",
    "                             axis=1)\n",
    "    else:\n",
    "        unfinished = zero_mask\n",
    "        finished = flag\n",
    "    print(unfinished)\n",
    "    print(finished)\n",
    "    \n",
    "    #score.masked_fill_(unfinished, -float('inf'))\n",
    "    #score.masked_fill_(finished, 0)\n",
    "#     infs = paddle.ones_like(score) * -float('inf')\n",
    "#     score = paddle.where(unfinished, infs, score)\n",
    "#     score = paddle.where(finished, paddle.zeros_like(score), score)\n",
    "\n",
    "#     score = score.masked_fill(unfinished, -float('inf'))\n",
    "#     score = score.masked_fill(finished, 0)\n",
    "    score.masked_fill_(unfinished, -float('inf'))\n",
    "    score.masked_fill_(finished, 0)\n",
    "    return score\n",
    "\n",
    "r  = mask_finished_scores_pd(score, flag)\n",
    "print(r)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 57,
   "id": "vocal-prime",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<bound method PyCapsule.value of Tensor(shape=[2, 3], dtype=float32, place=CUDAPlace(0), stop_gradient=True,\n",
       "       [[ 0.        , -inf.      , -inf.      ],\n",
       "        [-0.40165186,  0.77547729, -0.64469045]])>"
      ]
     },
     "execution_count": 57,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "score.value"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 71,
   "id": "bacterial-adolescent",
   "metadata": {},
   "outputs": [],
   "source": [
    "from typing import Union, Any"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 72,
   "id": "absent-fiber",
   "metadata": {},
   "outputs": [],
   "source": [
    "def repeat(xs : paddle.Tensor, *size: Any):\n",
    "    print(size)\n",
    "    return paddle.tile(xs, size)\n",
    "paddle.Tensor.repeat = repeat"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 73,
   "id": "material-harbor",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(1, 2)\n",
      "Tensor(shape=[2, 2], dtype=bool, place=CUDAPlace(0), stop_gradient=True,\n",
      "       [[True , True ],\n",
      "        [False, False]])\n"
     ]
    }
   ],
   "source": [
    "flag = paddle.ones((2, 1), dtype='bool')\n",
    "flag[1] = False\n",
    "print(flag.repeat(1, 2))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 84,
   "id": "acute-brighton",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(Tensor(shape=[1], dtype=int64, place=CUDAPlace(0), stop_gradient=True,\n",
      "       [1]), 2)\n",
      "Tensor(shape=[2, 2], dtype=bool, place=CUDAPlace(0), stop_gradient=True,\n",
      "       [[True , True ],\n",
      "        [False, False]])\n"
     ]
    }
   ],
   "source": [
    "flag = paddle.ones((2, 1), dtype='bool')\n",
    "flag[1] = False\n",
    "print(flag.repeat(paddle.to_tensor(1), 2))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 85,
   "id": "european-rugby",
   "metadata": {},
   "outputs": [],
   "source": [
    "def size(xs, *args: int):\n",
    "    nargs = len(args)\n",
    "    s = paddle.shape(xs)\n",
    "    assert(nargs <= 1)\n",
    "    if nargs == 1:\n",
    "        return s[args[0]]\n",
    "    else:\n",
    "        return s\n",
    "paddle.Tensor.size = size"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 86,
   "id": "moral-special",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Tensor(shape=[2], dtype=int32, place=CPUPlace, stop_gradient=True,\n",
       "       [2, 1])"
      ]
     },
     "execution_count": 86,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "flag.size()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 87,
   "id": "ahead-coach",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Tensor(shape=[1], dtype=int32, place=CPUPlace, stop_gradient=True,\n",
       "       [1])"
      ]
     },
     "execution_count": 87,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "flag.size(1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 88,
   "id": "incomplete-fitness",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Tensor(shape=[1], dtype=int32, place=CPUPlace, stop_gradient=True,\n",
       "       [2])"
      ]
     },
     "execution_count": 88,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "flag.size(0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "upset-connectivity",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
 }
--- a/.notebook/position_embeding_check.ipynb
+++ b/.notebook/position_embeding_check.ipynb
@ -0,0 +1,231 @@
 {
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "designing-borough",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/workspace/DeepSpeech-2.x/tools/venv/lib/python3.7/site-packages/ipykernel/ipkernel.py:283: DeprecationWarning: `should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.\n",
      "  and should_run_async(code)\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[[ 0.0000000e+00  0.0000000e+00  0.0000000e+00 ...  0.0000000e+00\n",
      "   0.0000000e+00  0.0000000e+00]\n",
      " [ 8.4147096e-01  8.0196178e-01  7.6172036e-01 ...  1.2409373e-04\n",
      "   1.1547816e-04  1.0746076e-04]\n",
      " [ 9.0929741e-01  9.5814437e-01  9.8704624e-01 ...  2.4818745e-04\n",
      "   2.3095631e-04  2.1492151e-04]\n",
      " ...\n",
      " [ 3.7960774e-01  7.4510968e-01  7.3418564e-01 ...  1.2036801e-02\n",
      "   1.1201146e-02  1.0423505e-02]\n",
      " [-5.7338190e-01 -8.9752287e-02 -4.1488394e-02 ...  1.2160885e-02\n",
      "   1.1316618e-02  1.0530960e-02]\n",
      " [-9.9920684e-01 -8.5234123e-01 -7.8794664e-01 ...  1.2284970e-02\n",
      "   1.1432089e-02  1.0638415e-02]]\n",
      "True\n",
      "True\n"
     ]
    }
   ],
   "source": [
    "import torch\n",
    "import math\n",
    "import numpy as np\n",
    "\n",
    "max_len=100\n",
    "d_model=256\n",
    "\n",
    "pe = torch.zeros(max_len, d_model)\n",
    "position = torch.arange(0, max_len,\n",
    "                        dtype=torch.float32).unsqueeze(1)\n",
    "toruch_position = position\n",
    "div_term = torch.exp(\n",
    "    torch.arange(0, d_model, 2, dtype=torch.float32) *\n",
    "    -(math.log(10000.0) / d_model))\n",
    "tourch_div_term = div_term.cpu().detach().numpy()\n",
    "\n",
    "\n",
    "\n",
    "torhc_sin = torch.sin(position * div_term)\n",
    "torhc_cos = torch.cos(position * div_term)\n",
    "print(torhc_sin.cpu().detach().numpy())\n",
    "np_sin = np.sin((position * div_term).cpu().detach().numpy())\n",
    "np_cos = np.cos((position * div_term).cpu().detach().numpy())\n",
    "print(np.allclose(np_sin, torhc_sin.cpu().detach().numpy()))\n",
    "print(np.allclose(np_cos, torhc_cos.cpu().detach().numpy()))\n",
    "pe[:, 0::2] = torhc_sin\n",
    "pe[:, 1::2] = torhc_cos\n",
    "tourch_pe = pe.cpu().detach().numpy()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "swiss-referral",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "True\n",
      "True\n",
      "False\n",
      "False\n",
      "False\n",
      "False\n",
      "[[ 1.          1.          1.         ...  1.          1.\n",
      "   1.        ]\n",
      " [ 0.5403023   0.59737533  0.6479059  ...  1.          1.\n",
      "   1.        ]\n",
      " [-0.41614684 -0.28628543 -0.1604359  ...  0.99999994  1.\n",
      "   1.        ]\n",
      " ...\n",
      " [-0.92514753 -0.66694194 -0.67894876 ...  0.9999276   0.99993724\n",
      "   0.9999457 ]\n",
      " [-0.81928825 -0.9959641  -0.999139   ...  0.99992603  0.999936\n",
      "   0.99994457]\n",
      " [ 0.03982088 -0.52298605 -0.6157435  ...  0.99992454  0.9999347\n",
      "   0.99994344]]\n",
      "----\n",
      "[[ 1.          1.          1.         ...  1.          1.\n",
      "   1.        ]\n",
      " [ 0.54030234  0.59737533  0.6479059  ...  1.          1.\n",
      "   1.        ]\n",
      " [-0.41614684 -0.28628543 -0.1604359  ...  1.          1.\n",
      "   1.        ]\n",
      " ...\n",
      " [-0.92514753 -0.66694194 -0.67894876 ...  0.9999276   0.9999373\n",
      "   0.9999457 ]\n",
      " [-0.81928825 -0.9959641  -0.999139   ...  0.99992603  0.999936\n",
      "   0.99994457]\n",
      " [ 0.03982088 -0.5229861  -0.6157435  ...  0.99992454  0.9999347\n",
      "   0.99994344]]\n",
      ")))))))\n",
      "[[ 0.0000000e+00  0.0000000e+00  0.0000000e+00 ...  0.0000000e+00\n",
      "   0.0000000e+00  0.0000000e+00]\n",
      " [ 8.4147096e-01  8.0196178e-01  7.6172036e-01 ...  1.2409373e-04\n",
      "   1.1547816e-04  1.0746076e-04]\n",
      " [ 9.0929741e-01  9.5814437e-01  9.8704624e-01 ...  2.4818745e-04\n",
      "   2.3095631e-04  2.1492151e-04]\n",
      " ...\n",
      " [ 3.7960774e-01  7.4510968e-01  7.3418564e-01 ...  1.2036801e-02\n",
      "   1.1201146e-02  1.0423505e-02]\n",
      " [-5.7338190e-01 -8.9752287e-02 -4.1488394e-02 ...  1.2160885e-02\n",
      "   1.1316618e-02  1.0530960e-02]\n",
      " [-9.9920684e-01 -8.5234123e-01 -7.8794664e-01 ...  1.2284970e-02\n",
      "   1.1432089e-02  1.0638415e-02]]\n",
      "----\n",
      "[[ 0.0000000e+00  0.0000000e+00  0.0000000e+00 ...  0.0000000e+00\n",
      "   0.0000000e+00  0.0000000e+00]\n",
      " [ 8.4147096e-01  8.0196178e-01  7.6172036e-01 ...  1.2409373e-04\n",
      "   1.1547816e-04  1.0746076e-04]\n",
      " [ 9.0929741e-01  9.5814437e-01  9.8704624e-01 ...  2.4818745e-04\n",
      "   2.3095631e-04  2.1492151e-04]\n",
      " ...\n",
      " [ 3.7960774e-01  7.4510968e-01  7.3418564e-01 ...  1.2036801e-02\n",
      "   1.1201146e-02  1.0423505e-02]\n",
      " [-5.7338190e-01 -8.9752287e-02 -4.1488394e-02 ...  1.2160885e-02\n",
      "   1.1316618e-02  1.0530960e-02]\n",
      " [-9.9920684e-01 -8.5234123e-01 -7.8794664e-01 ...  1.2284970e-02\n",
      "   1.1432089e-02  1.0638415e-02]]\n"
     ]
    }
   ],
   "source": [
    "import paddle\n",
    "paddle.set_device('cpu')\n",
    "ppe = paddle.zeros((max_len, d_model), dtype='float32')\n",
    "position = paddle.arange(0, max_len,\n",
    "                        dtype='float32').unsqueeze(1)\n",
    "print(np.allclose(position.numpy(), toruch_position))\n",
    "div_term = paddle.exp(\n",
    "    paddle.arange(0, d_model, 2, dtype='float32') *\n",
    "    -(math.log(10000.0) / d_model))\n",
    "print(np.allclose(div_term.numpy(), tourch_div_term))\n",
    "\n",
    "\n",
    "\n",
    "p_sin = paddle.sin(position * div_term)\n",
    "p_cos = paddle.cos(position * div_term)\n",
    "print(np.allclose(np_sin, p_sin.numpy(), rtol=1.e-6, atol=0))\n",
    "print(np.allclose(np_cos, p_cos.numpy(), rtol=1.e-6, atol=0))\n",
    "ppe[:, 0::2] = p_sin\n",
    "ppe[:, 1::2] = p_cos\n",
    "print(np.allclose(p_sin.numpy(), torhc_sin.cpu().detach().numpy()))\n",
    "print(np.allclose(p_cos.numpy(), torhc_cos.cpu().detach().numpy()))\n",
    "print(p_cos.numpy())\n",
    "print(\"----\")\n",
    "print(torhc_cos.cpu().detach().numpy())\n",
    "print(\")))))))\")\n",
    "print(p_sin.numpy())\n",
    "print(\"----\")\n",
    "print(torhc_sin.cpu().detach().numpy())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "integrated-boards",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "False\n"
     ]
    }
   ],
   "source": [
    "print(np.allclose(ppe.numpy(), pe.numpy()))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "flying-reserve",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "revised-divide",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
 }
--- a/.notebook/python_test.ipynb
+++ b/.notebook/python_test.ipynb
--- a/.notebook/train_test.ipynb
+++ b/.notebook/train_test.ipynb
@ -249,7 +249,7 @@
    }
   ],
   "source": [
-    "    for idx, (audio, text, audio_len, text_len) in enumerate(batch_reader()):\n",
+    "    for idx, (audio, audio_len, text, text_len) in enumerate(batch_reader()):\n",
    "        print('test', text)\n",
    "        print(\"test raw\", ''.join(batch_reader.dataset.vocab_list[i] for i in text[0]))\n",
    "        print(\"test raw\", ''.join(batch_reader.dataset.vocab_list[i] for i in text[-1]))\n",
@ -454,7 +454,7 @@
    "            act='brelu')\n",
    "\n",
    "        out_channel = 32\n",
-    "        self.conv_stack = nn.LayerList([\n",
+    "        self.conv_stack = nn.Sequential([\n",
    "            ConvBn(\n",
    "                num_channels_in=32,\n",
    "                num_channels_out=out_channel,\n",
@ -835,7 +835,7 @@
    "\n",
    "        return logits, probs, audio_len\n",
    "\n",
-    "    def forward(self, audio, text, audio_len, text_len):\n",
+    "    def forward(self, audio, audio_len, text, text_len):\n",
    "        \"\"\"\n",
    "        audio: shape [B, D, T]\n",
    "        text: shape [B, T]\n",
@ -877,10 +877,10 @@
   "metadata": {},
   "outputs": [],
   "source": [
-    "audio, text, audio_len, text_len = None, None, None, None\n",
+    "audio, audio_len, text, text_len = None, None, None, None\n",
    "\n",
    "for idx, inputs in enumerate(batch_reader):\n",
-    "    audio, text, audio_len, text_len = inputs\n",
+    "    audio, audio_len, text, text_len = inputs\n",
    "#     print(idx)\n",
    "#     print('a', audio.shape, audio.place)\n",
    "#     print('t', text)\n",
@ -960,7 +960,7 @@
    }
   ],
   "source": [
-    "outputs = dp_model(audio, text, audio_len, text_len)\n",
+    "outputs = dp_model(audio, audio_len, text, text_len)\n",
    "logits, _, logits_len = outputs\n",
    "print('logits len', logits_len)\n",
    "loss = loss_fn.forward(logits, text, logits_len, text_len)\n",
@ -1884,4 +1884,4 @@
 },
 "nbformat": 4,
 "nbformat_minor": 5
-}
+}
--- a/.notebook/u2_model.ipynb
+++ b/.notebook/u2_model.ipynb
--- a/.pre-commit-config.yaml
+++ b/.pre-commit-config.yaml
@ -3,6 +3,7 @@
    hooks:
    -   id: yapf
        files: \.py$
        exclude: (?=third_party).*(\.py)$
 -   repo: https://github.com/pre-commit/pre-commit-hooks
    sha: a11d9314b22d8f8c7556443875b731ef05965464
    hooks:
@ -14,7 +15,22 @@
        files: \.md$
    -   id: trailing-whitespace
        files: \.md$
-   repo: https://github.com/Lucas-C/pre-commit-hooks
+    -   id: requirements-txt-fixer
        exclude: (?=third_party).*$
    -   id: check-yaml
    -   id: check-json
    -   id: pretty-format-json
        args:
        - --no-sort-keys
        - --autofix
    -   id: check-merge-conflict
    -   id: flake8
        aergs:
        -  --ignore=E501,E228,E226,E261,E266,E128,E402,W503
        -  --builtins=G,request
        -  --jobs=1
        exclude: (?=third_party).*(\.py)$
 -   repo : https://github.com/Lucas-C/pre-commit-hooks
    sha: v1.0.1
    hooks:
    -   id: forbid-crlf
@ -38,4 +54,9 @@
        entry: python .pre-commit-hooks/copyright-check.hook
        language: system
        files: \.(c|cc|cxx|cpp|cu|h|hpp|hxx|proto|py)$
-        #exclude: (?=decoders/swig).*(\.cpp|\.h)$
+        exclude: (?=third_party|pypinyin).*(\.cpp|\.h|\.py)$
 -   repo: https://github.com/asottile/reorder_python_imports
    rev: v2.4.0
    hooks:
      - id: reorder-python-imports
        exclude: (?=third_party).*(\.py)$
--- a/.travis.yml
+++ b/.travis.yml
@ -19,14 +19,14 @@ addons:
 before_install:
  -  python3 --version
  -  python3 -m pip --version
-  -  sudo pip install -U virtualenv pre-commit pip
+  -  pip3 --version
  -  sudo pip3 install -U virtualenv pre-commit pip
  -  docker pull paddlepaddle/paddle:latest
 script:
  - exit_code=0
  - .travis/precommit.sh || exit_code=$(( exit_code | $? ))
  - docker run -i --rm -v "$PWD:/py_unittest" paddlepaddle/paddle:latest /bin/bash -c
-    'cd /py_unittest; source env.sh; bash .travis/unittest.sh' || exit_code=$(( exit_code | $? ))
+    'cd /py_unittest && bash .travis/precommit.sh && source env.sh && bash .travis/unittest.sh' || exit_code=$(( exit_code | $? ))
    exit $exit_code
 notifications:
--- a/.travis/install.sh
+++ b/.travis/install.sh
@ -0,0 +1,37 @@
 #!/bin/bash
 setup_env(){
    cd tools && make && cd - 
 }
 install(){
    if [ -f "setup.sh" ]; then
        bash setup.sh
        #export LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH
    fi
    if [ $? != 0 ]; then
        exit 1
    fi
 }
 print_env(){
    cat /etc/lsb-release
    gcc -v
    g++ -v
 }
 abort(){
    echo "Run install failed" 1>&2
    echo "Please check your code" 1>&2
    exit 1
 }
 trap 'abort' 0
 set -e
 print_env
 setup_env
 source tools/venv/bin/activate
 install
 trap : 0
--- a/.travis/precommit.sh
+++ b/.travis/precommit.sh
@ -1,16 +1,18 @@
 #!/bin/bash
 function abort(){
    echo "Your commit not fit PaddlePaddle code style" 1>&2
    echo "Please use pre-commit scripts to auto-format your code" 1>&2
    exit 1
 }
 trap 'abort' 0
 set -e
-cd `dirname $0`
+
-cd ..
+source tools/venv/bin/activate
-export PATH=/usr/bin:$PATH
+
-pre-commit install
+python3 --version
 if ! pre-commit run -a ; then
  ls -lh
--- a/.travis/unittest.sh
+++ b/.travis/unittest.sh
@ -1,11 +1,14 @@
 #!/bin/bash
 abort(){
    echo "Run unittest failed" 1>&2
    echo "Please check your code" 1>&2
    exit 1
 }
 unittest(){
    cd $1 > /dev/null
    if [ -f "setup.sh" ]; then
@ -21,13 +24,31 @@ unittest(){
    cd - > /dev/null
 }
 coverage(){
    cd $1 > /dev/null
    if [ -f "setup.sh" ]; then
        bash setup.sh
        export LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH
    fi
    if [ $? != 0 ]; then
        exit 1
    fi
    find . -path ./tools/venv -prune -false -o -name 'tests' -type d -print0 | \
        xargs -0 -I{} -n1 bash -c \
        'python3 -m coverage run --branch {}'
    python3 -m coverage report -m
    python3 -m coverage html
    cd - > /dev/null
 }
 trap 'abort' 0
 set -e
-cd tools; make; cd - 
+source tools/venv/bin/activate
-. tools/venv/bin/activate
+#pip3 install pytest
-pip3 install pytest
+#unittest .
-
+coverage .
 unittest .
 trap : 0
--- a/.vimrc
+++ b/.vimrc
@ -0,0 +1,468 @@
 """""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
 " Maintainer: 
 "       Amir Salihefendic — @amix3k
 "
 " Awesome_version:
 "       Get this config, nice color schemes and lots of plugins!
 "
 "       Install the awesome version from:
 "
 "           https://github.com/amix/vimrc
 "
 " Sections:
 "    -> General
 "    -> VIM user interface
 "    -> Colors and Fonts
 "    -> Files and backups
 "    -> Text, tab and indent related
 "    -> Visual mode related
 "    -> Moving around, tabs and buffers
 "    -> Status line
 "    -> Editing mappings
 "    -> vimgrep searching and cope displaying
 "    -> Spell checking
 "    -> Misc
 "    -> Helper functions
 "
 """""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
 """""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
 " => General
 """""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
 " Sets how many lines of history VIM has to remember
 set history=500
 " Enable filetype plugins
 filetype plugin on
 filetype indent on
 " Set to auto read when a file is changed from the outside
 set autoread
 au FocusGained,BufEnter * checktime
 " With a map leader it's possible to do extra key combinations
 " like <leader>w saves the current file
 let mapleader = ","
 " Fast saving
 nmap <leader>w :w!<cr>
 " :W sudo saves the file 
 " (useful for handling the permission-denied error)
 command! W execute 'w !sudo tee % > /dev/null' <bar> edit!
 """""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
 " => VIM user interface
 """""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
 " Set 7 lines to the cursor - when moving vertically using j/k
 set so=7
 " Avoid garbled characters in Chinese language windows OS
 let $LANG='en' 
 set langmenu=en
 source $VIMRUNTIME/delmenu.vim
 source $VIMRUNTIME/menu.vim
 " Turn on the Wild menu
 set wildmenu
 " Ignore compiled files
 set wildignore=*.o,*~,*.pyc
 if has("win16") || has("win32")
    set wildignore+=.git\*,.hg\*,.svn\*
 else
    set wildignore+=*/.git/*,*/.hg/*,*/.svn/*,*/.DS_Store
 endif
 "Always show current position
 set ruler
 " Height of the command bar
 set cmdheight=1
 " A buffer becomes hidden when it is abandoned
 set hid
 " Configure backspace so it acts as it should act
 set backspace=eol,start,indent
 set whichwrap+=<,>,h,l
 " Ignore case when searching
 set ignorecase
 " When searching try to be smart about cases 
 set smartcase
 " Highlight search results
 set hlsearch
 " Makes search act like search in modern browsers
 set incsearch 
 " Don't redraw while executing macros (good performance config)
 set lazyredraw 
 " For regular expressions turn magic on
 set magic
 " Show matching brackets when text indicator is over them
 set showmatch 
 " How many tenths of a second to blink when matching brackets
 set mat=2
 " No annoying sound on errors
 set noerrorbells
 set novisualbell
 set t_vb=
 set tm=500
 " Properly disable sound on errors on MacVim
 if has("gui_macvim")
    autocmd GUIEnter * set vb t_vb=
 endif
 " Add a bit extra margin to the left
 set foldcolumn=1
 """""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
 " => Colors and Fonts
 """""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
 " Enable syntax highlighting
 syntax enable 
 " Enable 256 colors palette in Gnome Terminal
 if $COLORTERM == 'gnome-terminal'
    set t_Co=256
 endif
 try
    colorscheme desert
 catch
 endtry
 set background=dark
 " Set extra options when running in GUI mode
 if has("gui_running")
    set guioptions-=T
    set guioptions-=e
    set t_Co=256
    set guitablabel=%M\ %t
 endif
 " Set utf8 as standard encoding and en_US as the standard language
 set encoding=utf8
 set fileencodings=ucs-bom,utf-8,cp936
 set fileencoding=gb2312
 set termencoding=utf-8
 " Use Unix as the standard file type
 set ffs=unix,dos,mac
 """""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
 " => Files, backups and undo
 """""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
 " Turn backup off, since most stuff is in SVN, git etc. anyway...
 set nobackup
 set nowb
 set noswapfile
 """""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
 " => Text, tab and indent related
 """""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
 " Use spaces instead of tabs
 set expandtab
 " Be smart when using tabs ;)
 set smarttab
 " 1 tab == 4 spaces
 set shiftwidth=4
 set tabstop=4
 " Linebreak on 500 characters
 set lbr
 set tw=500
 set ai "Auto indent
 set si "Smart indent
 set wrap "Wrap lines
 """"""""""""""""""""""""""""""
 " => Visual mode related
 """"""""""""""""""""""""""""""
 " Visual mode pressing * or # searches for the current selection
 " Super useful! From an idea by Michael Naumann
 vnoremap <silent> * :<C-u>call VisualSelection('', '')<CR>/<C-R>=@/<CR><CR>
 vnoremap <silent> # :<C-u>call VisualSelection('', '')<CR>?<C-R>=@/<CR><CR>
 """""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
 " => Moving around, tabs, windows and buffers
 """""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
 " Map <Space> to / (search) and Ctrl-<Space> to ? (backwards search)
 map <space> /
 map <C-space> ?
 " Disable highlight when <leader><cr> is pressed
 map <silent> <leader><cr> :noh<cr>
 " Smart way to move between windows
 map <C-j> <C-W>j
 map <C-k> <C-W>k
 map <C-h> <C-W>h
 map <C-l> <C-W>l
 " Close the current buffer
 map <leader>bd :Bclose<cr>:tabclose<cr>gT
 " Close all the buffers
 map <leader>ba :bufdo bd<cr>
 map <leader>l :bnext<cr>
 map <leader>h :bprevious<cr>
 " Useful mappings for managing tabs
 map <leader>tn :tabnew<cr>
 map <leader>to :tabonly<cr>
 map <leader>tc :tabclose<cr>
 map <leader>tm :tabmove 
 map <leader>t<leader> :tabnext 
 " Let 'tl' toggle between this and the last accessed tab
 let g:lasttab = 1
 nmap <Leader>tl :exe "tabn ".g:lasttab<CR>
 au TabLeave * let g:lasttab = tabpagenr()
 " Opens a new tab with the current buffer's path
 " Super useful when editing files in the same directory
 map <leader>te :tabedit <C-r>=expand("%:p:h")<cr>/
 " Switch CWD to the directory of the open buffer
 map <leader>cd :cd %:p:h<cr>:pwd<cr>
 " Specify the behavior when switching between buffers 
 try
  set switchbuf=useopen,usetab,newtab
  set stal=2
 catch
 endtry
 " Return to last edit position when opening files (You want this!)
 au BufReadPost * if line("'\"") > 1 && line("'\"") <= line("$") | exe "normal! g'\"" | endif
 """"""""""""""""""""""""""""""
 " => Status line
 """"""""""""""""""""""""""""""
 " Always show the status line
 set laststatus=2
 " Format the status line
 set statusline=\ %{HasPaste()}%F%m%r%h\ %w\ \ CWD:\ %r%{getcwd()}%h\ \ \ Line:\ %l\ \ Column:\ %c
 """""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
 " => Editing mappings
 """""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
 " Remap VIM 0 to first non-blank character
 map 0 ^
 " Move a line of text using ALT+[jk] or Command+[jk] on mac
 nmap <M-j> mz:m+<cr>`z
 nmap <M-k> mz:m-2<cr>`z
 vmap <M-j> :m'>+<cr>`<my`>mzgv`yo`z
 vmap <M-k> :m'<-2<cr>`>my`<mzgv`yo`z
 if has("mac") || has("macunix")
  nmap <D-j> <M-j>
  nmap <D-k> <M-k>
  vmap <D-j> <M-j>
  vmap <D-k> <M-k>
 endif
 " Delete trailing white space on save, useful for some filetypes ;)
 fun! CleanExtraSpaces()
    let save_cursor = getpos(".")
    let old_query = getreg('/')
    silent! %s/\s\+$//e
    call setpos('.', save_cursor)
    call setreg('/', old_query)
 endfun
 if has("autocmd")
    autocmd BufWritePre *.txt,*.js,*.py,*.wiki,*.sh,*.coffee :call CleanExtraSpaces()
 endif
 """""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
 " => Spell checking
 """""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
 " Pressing ,ss will toggle and untoggle spell checking
 map <leader>ss :setlocal spell!<cr>
 " Shortcuts using <leader>
 map <leader>sn ]s
 map <leader>sp [s
 map <leader>sa zg
 map <leader>s? z=
 """""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
 " => Misc
 """""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
 " Remove the Windows ^M - when the encodings gets messed up
 noremap <Leader>m mmHmt:%s/<C-V><cr>//ge<cr>'tzt'm
 " Quickly open a buffer for scribble
 map <leader>q :e ~/buffer<cr>
 " Quickly open a markdown buffer for scribble
 map <leader>x :e ~/buffer.md<cr>
 " Toggle paste mode on and off
 map <leader>pp :setlocal paste!<cr>
 """""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
 " => Helper functions
 """""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
 " Returns true if paste mode is enabled
 function! HasPaste()
    if &paste
        return 'PASTE MODE  '
    endif
    return ''
 endfunction
 " Don't close window, when deleting a buffer
 command! Bclose call <SID>BufcloseCloseIt()
 function! <SID>BufcloseCloseIt()
    let l:currentBufNum = bufnr("%")
    let l:alternateBufNum = bufnr("#")
    if buflisted(l:alternateBufNum)
        buffer #
    else
        bnext
    endif
    if bufnr("%") == l:currentBufNum
        new
    endif
    if buflisted(l:currentBufNum)
        execute("bdelete! ".l:currentBufNum)
    endif
 endfunction
 function! CmdLine(str)
    call feedkeys(":" . a:str)
 endfunction 
 function! VisualSelection(direction, extra_filter) range
    let l:saved_reg = @"
    execute "normal! vgvy"
    let l:pattern = escape(@", "\\/.*'$^~[]")
    let l:pattern = substitute(l:pattern, "\n$", "", "")
    if a:direction == 'gv'
        call CmdLine("Ack '" . l:pattern . "' " )
    elseif a:direction == 'replace'
        call CmdLine("%s" . '/'. l:pattern . '/')
    endif
    let @/ = l:pattern
    let @" = l:saved_reg
 endfunction
 """"""""""""""""""""""""""""""
 " => Python section
 """"""""""""""""""""""""""""""
 let python_highlight_all = 1
 au FileType python syn keyword pythonDecorator True None False self
 au BufNewFile,BufRead *.jinja set syntax=htmljinja
 au BufNewFile,BufRead *.mako set ft=mako
 au FileType python map <buffer> F :set foldmethod=indent<cr>
 au FileType python inoremap <buffer> $r return 
 au FileType python inoremap <buffer> $i import 
 au FileType python inoremap <buffer> $p print 
 au FileType python inoremap <buffer> $f # --- <esc>a
 au FileType python map <buffer> <leader>1 /class 
 au FileType python map <buffer> <leader>2 /def 
 au FileType python map <buffer> <leader>C ?class 
 au FileType python map <buffer> <leader>D ?def 
 """"""""""""""""""""""""""""""
 " => JavaScript section
 """""""""""""""""""""""""""""""
 au FileType javascript call JavaScriptFold()
 au FileType javascript setl fen
 au FileType javascript setl nocindent
 au FileType javascript imap <C-t> $log();<esc>hi
 au FileType javascript imap <C-a> alert();<esc>hi
 au FileType javascript inoremap <buffer> $r return 
 au FileType javascript inoremap <buffer> $f // --- PH<esc>FP2xi
 function! JavaScriptFold() 
    setl foldmethod=syntax
    setl foldlevelstart=1
    syn region foldBraces start=/{/ end=/}/ transparent fold keepend extend
    function! FoldText()
        return substitute(getline(v:foldstart), '{.*', '{...}', '')
    endfunction
    setl foldtext=FoldText()
 endfunction
 """"""""""""""""""""""""""""""
 " => CoffeeScript section
 """""""""""""""""""""""""""""""
 function! CoffeeScriptFold()
    setl foldmethod=indent
    setl foldlevelstart=1
 endfunction
 au FileType coffee call CoffeeScriptFold()
 au FileType gitcommit call setpos('.', [0, 1, 1, 0])
 """"""""""""""""""""""""""""""
 " => Shell section
 """"""""""""""""""""""""""""""
 if exists('$TMUX') 
    if has('nvim')
        set termguicolors
    else
        set term=screen-256color 
    endif
 endif
 """"""""""""""""""""""""""""""
 " => Twig section
 """"""""""""""""""""""""""""""
 autocmd BufRead *.twig set syntax=html filetype=html
 """"""""""""""""""""""""""""""
 " => Markdown
 """"""""""""""""""""""""""""""
 let vim_markdown_folding_disabled = 1
--- a/README.md
+++ b/README.md
@ -11,7 +11,10 @@
 ## Models
-* [Baidu's Deep Speech2](http://proceedings.mlr.press/v48/amodei16.pdf)
+* [Baidu's DeepSpeech2](http://proceedings.mlr.press/v48/amodei16.pdf)
 * [Transformer](https://arxiv.org/abs/1706.03762)
 * [Conformer](https://arxiv.org/abs/2005.08100)
 * [U2](https://arxiv.org/pdf/2012.05481.pdf)
 ## Setup
@ -22,19 +25,20 @@ Please see [install](docs/install.md).
 ## Getting Started
-Please see [Getting Started](docs/getting_started.md) and [tiny egs](examples/tiny/README.md).
+Please see [Getting Started](docs/src/geting_started.md) and [tiny egs](examples/tiny/README.md).
 ## More Information  
-* [Install](docs/install.md)  
+* [Install](docs/src/install.md)  
-* [Getting Started](docs/getting_started.md)  
+* [Getting Started](docs/src/geting_stared.md)  
-* [Data Prepration](docs/data_preparation.md)  
+* [Data Prepration](docs/src/data_preparation.md)  
-* [Data Augmentation](docs/augmentation.md)  
+* [Data Augmentation](docs/src/augmentation.md)  
-* [Ngram LM](docs/ngram_lm.md)  
+* [Ngram LM](docs/src/ngram_lm.md)  
-* [Server Demo](docs/server.md)  
+* [Server Demo](docs/src/server.md)  
-* [Benchmark](docs/benchmark.md)  
+* [Benchmark](docs/src/benchmark.md)  
-* [Relased Model](docs/released_model.md)  
+* [Relased Model](docs/src/released_model.md)  
-* [FAQ](docs/faq.md)  
+* [FAQ](docs/src/faq.md)  
 ## Questions and Help
@ -45,3 +49,7 @@ You are welcome to submit questions in [Github Discussions](https://github.com/P
 ## License
 DeepSpeech is provided under the [Apache-2.0 License](./LICENSE).
 ## Acknowledgement
 We depends on many open source repos. See [References](docs/src/reference.md) for more information.
--- a/README_cn.md
+++ b/README_cn.md
@ -11,7 +11,11 @@
 ## 模型
-* [Baidu's Deep Speech2](http://proceedings.mlr.press/v48/amodei16.pdf)
+* [Baidu's DeepSpeech2](http://proceedings.mlr.press/v48/amodei16.pdf)
 * [Transformer](https://arxiv.org/abs/1706.03762)
 * [Conformer](https://arxiv.org/abs/2005.08100)
 * [U2](https://arxiv.org/pdf/2012.05481.pdf)
 ## 安装
@ -22,19 +26,19 @@
 ## 开始
-请查看 [Getting Started](docs/getting_started.md) 和 [tiny egs](examples/tiny/README.md)。
+请查看 [Getting Started](docs/src/geting_started.md) 和 [tiny egs](examples/tiny/README.md)。
 ## 更多信息
-* [安装](docs/install.md)  
+* [安装](docs/src/install.md)  
-* [开始](docs/getting_started.md)  
+* [开始](docs/src/geting_stared.md)  
-* [数据处理](docs/data_preparation.md)  
+* [数据处理](docs/src/data_preparation.md)  
-* [数据增强](docs/augmentation.md)  
+* [数据增强](docs/src/augmentation.md)  
-* [语言模型](docs/ngram_lm.md)  
+* [语言模型](docs/src/ngram_lm.md)  
-* [服务部署](docs/server.md)  
+* [服务部署](docs/src/server.md)  
-* [Benchmark](docs/benchmark.md)  
+* [Benchmark](docs/src/benchmark.md)  
-* [Relased Model](docs/released_model.md)  
+* [Relased Model](docs/src/released_model.md)  
-* [FAQ](docs/faq.md)  
+* [FAQ](docs/src/faq.md)  
 ## 问题和帮助
@ -43,3 +47,7 @@
 ## License
 DeepSpeech遵循[Apache-2.0开源协议](./LICENSE)。
 ## 感谢
 开发中参考一些优秀的仓库，详情参见 [References](docs/src/reference.md)。
--- a/deepspeech/init.py
+++ b/deepspeech/init.py
@ -11,3 +11,478 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 from typing import Any
 from typing import List
 from typing import Tuple
 from typing import Union
 import paddle
 from paddle import nn
 from paddle.fluid import core
 from paddle.nn import functional as F
 from deepspeech.utils.log import Log
 #TODO(Hui Zhang): remove  fluid import
 logger = Log(__name__).getlog()
 ########### hcak logging #############
 logger.warn = logger.warning
 ########### hcak paddle #############
 paddle.bool = 'bool'
 paddle.float16 = 'float16'
 paddle.half = 'float16'
 paddle.float32 = 'float32'
 paddle.float = 'float32'
 paddle.float64 = 'float64'
 paddle.double = 'float64'
 paddle.int8 = 'int8'
 paddle.int16 = 'int16'
 paddle.short = 'int16'
 paddle.int32 = 'int32'
 paddle.int = 'int32'
 paddle.int64 = 'int64'
 paddle.long = 'int64'
 paddle.uint8 = 'uint8'
 paddle.uint16 = 'uint16'
 paddle.complex64 = 'complex64'
 paddle.complex128 = 'complex128'
 paddle.cdouble = 'complex128'
 def convert_dtype_to_string(tensor_dtype):
    """
    Convert the data type in numpy to the data type in Paddle
    Args:
        tensor_dtype(core.VarDesc.VarType): the data type in numpy.
    Returns:
        core.VarDesc.VarType: the data type in Paddle.
    """
    dtype = tensor_dtype
    if dtype == core.VarDesc.VarType.FP32:
        return paddle.float32
    elif dtype == core.VarDesc.VarType.FP64:
        return paddle.float64
    elif dtype == core.VarDesc.VarType.FP16:
        return paddle.float16
    elif dtype == core.VarDesc.VarType.INT32:
        return paddle.int32
    elif dtype == core.VarDesc.VarType.INT16:
        return paddle.int16
    elif dtype == core.VarDesc.VarType.INT64:
        return paddle.int64
    elif dtype == core.VarDesc.VarType.BOOL:
        return paddle.bool
    elif dtype == core.VarDesc.VarType.BF16:
        # since there is still no support for bfloat16 in NumPy,
        # uint16 is used for casting bfloat16
        return paddle.uint16
    elif dtype == core.VarDesc.VarType.UINT8:
        return paddle.uint8
    elif dtype == core.VarDesc.VarType.INT8:
        return paddle.int8
    elif dtype == core.VarDesc.VarType.COMPLEX64:
        return paddle.complex64
    elif dtype == core.VarDesc.VarType.COMPLEX128:
        return paddle.complex128
    else:
        raise ValueError("Not supported tensor dtype %s" % dtype)
 if not hasattr(paddle, 'softmax'):
    logger.warn("register user softmax to paddle, remove this when fixed!")
    setattr(paddle, 'softmax', paddle.nn.functional.softmax)
 if not hasattr(paddle, 'log_softmax'):
    logger.warn("register user log_softmax to paddle, remove this when fixed!")
    setattr(paddle, 'log_softmax', paddle.nn.functional.log_softmax)
 if not hasattr(paddle, 'sigmoid'):
    logger.warn("register user sigmoid to paddle, remove this when fixed!")
    setattr(paddle, 'sigmoid', paddle.nn.functional.sigmoid)
 if not hasattr(paddle, 'log_sigmoid'):
    logger.warn("register user log_sigmoid to paddle, remove this when fixed!")
    setattr(paddle, 'log_sigmoid', paddle.nn.functional.log_sigmoid)
 if not hasattr(paddle, 'relu'):
    logger.warn("register user relu to paddle, remove this when fixed!")
    setattr(paddle, 'relu', paddle.nn.functional.relu)
 def cat(xs, dim=0):
    return paddle.concat(xs, axis=dim)
 if not hasattr(paddle, 'cat'):
    logger.warn(
        "override cat of paddle if exists or register, remove this when fixed!")
    paddle.cat = cat
 ########### hcak paddle.Tensor #############
 def item(x: paddle.Tensor):
    return x.numpy().item()
 if not hasattr(paddle.Tensor, 'item'):
    logger.warn(
        "override item of paddle.Tensor if exists or register, remove this when fixed!"
    )
    paddle.Tensor.item = item
 def func_long(x: paddle.Tensor):
    return paddle.cast(x, paddle.long)
 if not hasattr(paddle.Tensor, 'long'):
    logger.warn(
        "override long of paddle.Tensor if exists or register, remove this when fixed!"
    )
    paddle.Tensor.long = func_long
 if not hasattr(paddle.Tensor, 'numel'):
    logger.warn(
        "override numel of paddle.Tensor if exists or register, remove this when fixed!"
    )
    paddle.Tensor.numel = paddle.numel
 def new_full(x: paddle.Tensor,
             size: Union[List[int], Tuple[int], paddle.Tensor],
             fill_value: Union[float, int, bool, paddle.Tensor],
             dtype=None):
    return paddle.full(size, fill_value, dtype=x.dtype)
 if not hasattr(paddle.Tensor, 'new_full'):
    logger.warn(
        "override new_full of paddle.Tensor if exists or register, remove this when fixed!"
    )
    paddle.Tensor.new_full = new_full
 def eq(xs: paddle.Tensor, ys: Union[paddle.Tensor, float]) -> paddle.Tensor:
    if convert_dtype_to_string(xs.dtype) == paddle.bool:
        xs = xs.astype(paddle.int)
    return xs.equal(
        paddle.to_tensor(
            ys, dtype=convert_dtype_to_string(xs.dtype), place=xs.place))
 if not hasattr(paddle.Tensor, 'eq'):
    logger.warn(
        "override eq of paddle.Tensor if exists or register, remove this when fixed!"
    )
    paddle.Tensor.eq = eq
 if not hasattr(paddle, 'eq'):
    logger.warn(
        "override eq of paddle if exists or register, remove this when fixed!")
    paddle.eq = eq
 def contiguous(xs: paddle.Tensor) -> paddle.Tensor:
    return xs
 if not hasattr(paddle.Tensor, 'contiguous'):
    logger.warn(
        "override contiguous of paddle.Tensor if exists or register, remove this when fixed!"
    )
    paddle.Tensor.contiguous = contiguous
 def size(xs: paddle.Tensor, *args: int) -> paddle.Tensor:
    nargs = len(args)
    assert (nargs <= 1)
    s = paddle.shape(xs)
    if nargs == 1:
        return s[args[0]]
    else:
        return s
 #`to_static` do not process `size` property, maybe some `paddle` api dependent on it.
 logger.warn(
    "override size of paddle.Tensor "
    "(`to_static` do not process `size` property, maybe some `paddle` api dependent on it), remove this when fixed!"
 )
 paddle.Tensor.size = size
 def view(xs: paddle.Tensor, *args: int) -> paddle.Tensor:
    return xs.reshape(args)
 if not hasattr(paddle.Tensor, 'view'):
    logger.warn("register user view to paddle.Tensor, remove this when fixed!")
    paddle.Tensor.view = view
 def view_as(xs: paddle.Tensor, ys: paddle.Tensor) -> paddle.Tensor:
    return xs.reshape(ys.size())
 if not hasattr(paddle.Tensor, 'view_as'):
    logger.warn(
        "register user view_as to paddle.Tensor, remove this when fixed!")
    paddle.Tensor.view_as = view_as
 def is_broadcastable(shp1, shp2):
    for a, b in zip(shp1[::-1], shp2[::-1]):
        if a == 1 or b == 1 or a == b:
            pass
        else:
            return False
    return True
 def masked_fill(xs: paddle.Tensor,
                mask: paddle.Tensor,
                value: Union[float, int]):
    assert is_broadcastable(xs.shape, mask.shape) is True
    bshape = paddle.broadcast_shape(xs.shape, mask.shape)
    mask = mask.broadcast_to(bshape)
    trues = paddle.ones_like(xs) * value
    xs = paddle.where(mask, trues, xs)
    return xs
 if not hasattr(paddle.Tensor, 'masked_fill'):
    logger.warn(
        "register user masked_fill to paddle.Tensor, remove this when fixed!")
    paddle.Tensor.masked_fill = masked_fill
 def masked_fill_(xs: paddle.Tensor,
                 mask: paddle.Tensor,
                 value: Union[float, int]) -> paddle.Tensor:
    assert is_broadcastable(xs.shape, mask.shape) is True
    bshape = paddle.broadcast_shape(xs.shape, mask.shape)
    mask = mask.broadcast_to(bshape)
    trues = paddle.ones_like(xs) * value
    ret = paddle.where(mask, trues, xs)
    paddle.assign(ret.detach(), output=xs)
    return xs
 if not hasattr(paddle.Tensor, 'masked_fill_'):
    logger.warn(
        "register user masked_fill_ to paddle.Tensor, remove this when fixed!")
    paddle.Tensor.masked_fill_ = masked_fill_
 def fill_(xs: paddle.Tensor, value: Union[float, int]) -> paddle.Tensor:
    val = paddle.full_like(xs, value)
    paddle.assign(val.detach(), output=xs)
    return xs
 if not hasattr(paddle.Tensor, 'fill_'):
    logger.warn("register user fill_ to paddle.Tensor, remove this when fixed!")
    paddle.Tensor.fill_ = fill_
 def repeat(xs: paddle.Tensor, *size: Any) -> paddle.Tensor:
    return paddle.tile(xs, size)
 if not hasattr(paddle.Tensor, 'repeat'):
    logger.warn(
        "register user repeat to paddle.Tensor, remove this when fixed!")
    paddle.Tensor.repeat = repeat
 if not hasattr(paddle.Tensor, 'softmax'):
    logger.warn(
        "register user softmax to paddle.Tensor, remove this when fixed!")
    setattr(paddle.Tensor, 'softmax', paddle.nn.functional.softmax)
 if not hasattr(paddle.Tensor, 'sigmoid'):
    logger.warn(
        "register user sigmoid to paddle.Tensor, remove this when fixed!")
    setattr(paddle.Tensor, 'sigmoid', paddle.nn.functional.sigmoid)
 if not hasattr(paddle.Tensor, 'relu'):
    logger.warn("register user relu to paddle.Tensor, remove this when fixed!")
    setattr(paddle.Tensor, 'relu', paddle.nn.functional.relu)
 def type_as(x: paddle.Tensor, other: paddle.Tensor) -> paddle.Tensor:
    return x.astype(other.dtype)
 if not hasattr(paddle.Tensor, 'type_as'):
    logger.warn(
        "register user type_as to paddle.Tensor, remove this when fixed!")
    setattr(paddle.Tensor, 'type_as', type_as)
 def to(x: paddle.Tensor, *args, **kwargs) -> paddle.Tensor:
    assert len(args) == 1
    if isinstance(args[0], str):  # dtype
        return x.astype(args[0])
    elif isinstance(args[0], paddle.Tensor):  #Tensor
        return x.astype(args[0].dtype)
    else:  # Device
        return x
 if not hasattr(paddle.Tensor, 'to'):
    logger.warn("register user to to paddle.Tensor, remove this when fixed!")
    setattr(paddle.Tensor, 'to', to)
 def func_float(x: paddle.Tensor) -> paddle.Tensor:
    return x.astype(paddle.float)
 if not hasattr(paddle.Tensor, 'float'):
    logger.warn("register user float to paddle.Tensor, remove this when fixed!")
    setattr(paddle.Tensor, 'float', func_float)
 def tolist(x: paddle.Tensor) -> List[Any]:
    return x.numpy().tolist()
 if not hasattr(paddle.Tensor, 'tolist'):
    logger.warn(
        "register user tolist to paddle.Tensor, remove this when fixed!")
    setattr(paddle.Tensor, 'tolist', tolist)
 ########### hcak paddle.nn.functional #############
 def glu(x: paddle.Tensor, axis=-1) -> paddle.Tensor:
    """The gated linear unit (GLU) activation."""
    a, b = x.split(2, axis=axis)
    act_b = F.sigmoid(b)
    return a * act_b
 if not hasattr(paddle.nn.functional, 'glu'):
    logger.warn(
        "register user glu to paddle.nn.functional, remove this when fixed!")
    setattr(paddle.nn.functional, 'glu', glu)
 # def softplus(x):
 #     """Softplus function."""
 #     if hasattr(paddle.nn.functional, 'softplus'):
 #         #return paddle.nn.functional.softplus(x.float()).type_as(x)
 #         return paddle.nn.functional.softplus(x)
 #     else:
 #         raise NotImplementedError
 # def gelu_accurate(x):
 #     """Gaussian Error Linear Units (GELU) activation."""
 #     # [reference] https://github.com/pytorch/fairseq/blob/e75cff5f2c1d62f12dc911e0bf420025eb1a4e33/fairseq/modules/gelu.py
 #     if not hasattr(gelu_accurate, "_a"):
 #         gelu_accurate._a = math.sqrt(2 / math.pi)
 #     return 0.5 * x * (1 + paddle.tanh(gelu_accurate._a *
 #                                       (x + 0.044715 * paddle.pow(x, 3))))
 # def gelu(x):
 #     """Gaussian Error Linear Units (GELU) activation."""
 #     if hasattr(nn.functional, 'gelu'):
 #         #return nn.functional.gelu(x.float()).type_as(x)
 #         return nn.functional.gelu(x)
 #     else:
 #         return x * 0.5 * (1.0 + paddle.erf(x / math.sqrt(2.0)))
 # hack loss
 def ctc_loss(logits,
             labels,
             input_lengths,
             label_lengths,
             blank=0,
             reduction='mean',
             norm_by_times=True):
    #logger.info("my ctc loss with norm by times")
    ## https://github.com/PaddlePaddle/Paddle/blob/f5ca2db2cc/paddle/fluid/operators/warpctc_op.h#L403
    loss_out = paddle.fluid.layers.warpctc(logits, labels, blank, norm_by_times,
                                           input_lengths, label_lengths)
    loss_out = paddle.fluid.layers.squeeze(loss_out, [-1])
    assert reduction in ['mean', 'sum', 'none']
    if reduction == 'mean':
        loss_out = paddle.mean(loss_out / label_lengths)
    elif reduction == 'sum':
        loss_out = paddle.sum(loss_out)
    return loss_out
 logger.warn(
    "override ctc_loss of paddle.nn.functional if exists, remove this when fixed!"
 )
 F.ctc_loss = ctc_loss
 ########### hcak paddle.nn #############
 if not hasattr(paddle.nn, 'Module'):
    logger.warn("register user Module to paddle.nn, remove this when fixed!")
    setattr(paddle.nn, 'Module', paddle.nn.Layer)
 # maybe cause assert isinstance(sublayer, core.Layer)
 if not hasattr(paddle.nn, 'ModuleList'):
    logger.warn(
        "register user ModuleList to paddle.nn, remove this when fixed!")
    setattr(paddle.nn, 'ModuleList', paddle.nn.LayerList)
 class GLU(nn.Layer):
    """Gated Linear Units (GLU) Layer"""
    def __init__(self, dim: int=-1):
        super().__init__()
        self.dim = dim
    def forward(self, xs):
        return glu(xs, dim=self.dim)
 if not hasattr(paddle.nn, 'GLU'):
    logger.warn("register user GLU to paddle.nn, remove this when fixed!")
    setattr(paddle.nn, 'GLU', GLU)
 # TODO(Hui Zhang): remove this Layer
 class ConstantPad2d(nn.Layer):
    """Pads the input tensor boundaries with a constant value.
    For N-dimensional padding, use paddle.nn.functional.pad().
    """
    def __init__(self, padding: Union[tuple, list, int], value: float):
        """
        Args:
            paddle ([tuple]): the size of the padding.
                If is int, uses the same padding in all boundaries.
                If a 4-tuple, uses (padding_left, padding_right, padding_top, padding_bottom)
            value ([flaot]): pad value
        """
        self.padding = padding if isinstance(padding,
                                             [tuple, list]) else [padding] * 4
        self.value = value
    def forward(self, xs: paddle.Tensor) -> paddle.Tensor:
        return nn.functional.pad(
            xs,
            self.padding,
            mode='constant',
            value=self.value,
            data_format='NCHW')
 if not hasattr(paddle.nn, 'ConstantPad2d'):
    logger.warn(
        "register user ConstantPad2d to paddle.nn, remove this when fixed!")
    setattr(paddle.nn, 'ConstantPad2d', ConstantPad2d)
 ########### hcak paddle.jit #############
 if not hasattr(paddle.jit, 'export'):
    logger.warn("register user export to paddle.jit, remove this when fixed!")
    setattr(paddle.jit, 'export', paddle.jit.to_static)
--- a/deepspeech/decoders/decoders_deprecated.py
+++ b/deepspeech/decoders/decoders_deprecated.py
@ -12,11 +12,11 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 """Contains various CTC decoders."""
-
+import multiprocessing
 from itertools import groupby
 import numpy as np
 from math import log
-import multiprocessing
+
 import numpy as np
 def ctc_greedy_decoder(probs_seq, vocabulary):
@ -104,14 +104,14 @@ def ctc_beam_search_decoder(probs_seq,
        global ext_nproc_scorer
        ext_scoring_func = ext_nproc_scorer
-    ## initialize
+    # initialize
    # prefix_set_prev: the set containing selected prefixes
    # probs_b_prev: prefixes' probability ending with blank in previous step
    # probs_nb_prev: prefixes' probability ending with non-blank in previous step
    prefix_set_prev = {'\t': 1.0}
    probs_b_prev, probs_nb_prev = {'\t': 1.0}, {'\t': 0.0}
-    ## extend prefix in loop
+    # extend prefix in loop
    for time_step in range(len(probs_seq)):
        # prefix_set_next: the set containing candidate prefixes
        # probs_b_cur: prefixes' probability ending with blank in current step
@ -120,7 +120,7 @@ def ctc_beam_search_decoder(probs_seq,
        prob_idx = list(enumerate(probs_seq[time_step]))
        cutoff_len = len(prob_idx)
-        #If pruning is enabled
+        # If pruning is enabled
        if cutoff_prob < 1.0 or cutoff_top_n < cutoff_len:
            prob_idx = sorted(prob_idx, key=lambda asd: asd[1], reverse=True)
            cutoff_len, cum_prob = 0, 0.0
@ -172,7 +172,7 @@ def ctc_beam_search_decoder(probs_seq,
        # update probs
        probs_b_prev, probs_nb_prev = probs_b_cur, probs_nb_cur
-        ## store top beam_size prefixes
+        # store top beam_size prefixes
        prefix_set_prev = sorted(
            prefix_set_next.items(), key=lambda asd: asd[1], reverse=True)
        if beam_size < len(prefix_set_prev):
@ -191,7 +191,7 @@ def ctc_beam_search_decoder(probs_seq,
        else:
            beam_result.append((float('-inf'), ''))
-    ## output top beam_size decoding results
+    # output top beam_size decoding results
    beam_result = sorted(beam_result, key=lambda asd: asd[0], reverse=True)
    return beam_result
--- a/deepspeech/decoders/scorer_deprecated.py
+++ b/deepspeech/decoders/scorer_deprecated.py
@ -12,8 +12,8 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 """External Scorer for Beam Search Decoder."""
 import os
 import kenlm
 import numpy as np
@ -71,7 +71,7 @@ class Scorer(object):
        """
        lm = self._language_model_score(sentence)
        word_cnt = self._word_count(sentence)
-        if log == False:
+        if log is False:
            score = np.power(lm, self._alpha) * np.power(word_cnt, self._beta)
        else:
            score = self._alpha * np.log(lm) + self._beta * np.log(word_cnt)
--- a/deepspeech/decoders/swig/ctc_beam_search_decoder.cpp
+++ b/deepspeech/decoders/swig/ctc_beam_search_decoder.cpp
@ -36,167 +36,177 @@ std::vector<std::pair<double, std::string>> ctc_beam_search_decoder(
    double cutoff_prob,
    size_t cutoff_top_n,
    Scorer *ext_scorer) {
-  // dimension check
+    // dimension check
-  size_t num_time_steps = probs_seq.size();
+    size_t num_time_steps = probs_seq.size();
-  for (size_t i = 0; i < num_time_steps; ++i) {
+    for (size_t i = 0; i < num_time_steps; ++i) {
-    VALID_CHECK_EQ(probs_seq[i].size(),
+        VALID_CHECK_EQ(probs_seq[i].size(),
-                   vocabulary.size() + 1,
+                       // vocabulary.size() + 1,
-                   "The shape of probs_seq does not match with "
+                       vocabulary.size(),
-                   "the shape of the vocabulary");
+                       "The shape of probs_seq does not match with "
-  }
+                       "the shape of the vocabulary");
  // assign blank id
  size_t blank_id = vocabulary.size();
  // assign space id
  auto it = std::find(vocabulary.begin(), vocabulary.end(), " ");
  int space_id = it - vocabulary.begin();
  // if no space in vocabulary
  if ((size_t)space_id >= vocabulary.size()) {
    space_id = -2;
  }
  // init prefixes' root
  PathTrie root;
  root.score = root.log_prob_b_prev = 0.0;
  std::vector<PathTrie *> prefixes;
  prefixes.push_back(&root);
  if (ext_scorer != nullptr && !ext_scorer->is_character_based()) {
    auto fst_dict = static_cast<fst::StdVectorFst *>(ext_scorer->dictionary);
    fst::StdVectorFst *dict_ptr = fst_dict->Copy(true);
    root.set_dictionary(dict_ptr);
    auto matcher = std::make_shared<FSTMATCH>(*dict_ptr, fst::MATCH_INPUT);
    root.set_matcher(matcher);
  }
  // prefix search over time
  for (size_t time_step = 0; time_step < num_time_steps; ++time_step) {
    auto &prob = probs_seq[time_step];
    float min_cutoff = -NUM_FLT_INF;
    bool full_beam = false;
    if (ext_scorer != nullptr) {
      size_t num_prefixes = std::min(prefixes.size(), beam_size);
      std::sort(
          prefixes.begin(), prefixes.begin() + num_prefixes, prefix_compare);
      min_cutoff = prefixes[num_prefixes - 1]->score +
                   std::log(prob[blank_id]) - std::max(0.0, ext_scorer->beta);
      full_beam = (num_prefixes == beam_size);
    }
-    std::vector<std::pair<size_t, float>> log_prob_idx =
+    // assign blank id
-        get_pruned_log_probs(prob, cutoff_prob, cutoff_top_n);
+    // size_t blank_id = vocabulary.size();
-    // loop over chars
+    size_t blank_id = 0;
-    for (size_t index = 0; index < log_prob_idx.size(); index++) {
+
-      auto c = log_prob_idx[index].first;
+    // assign space id
-      auto log_prob_c = log_prob_idx[index].second;
+    auto it = std::find(vocabulary.begin(), vocabulary.end(), " ");
-
+    int space_id = it - vocabulary.begin();
-      for (size_t i = 0; i < prefixes.size() && i < beam_size; ++i) {
+    // if no space in vocabulary
-        auto prefix = prefixes[i];
+    if ((size_t)space_id >= vocabulary.size()) {
-        if (full_beam && log_prob_c + prefix->score < min_cutoff) {
+        space_id = -2;
-          break;
+    }
-        }
+
-        // blank
+    // init prefixes' root
-        if (c == blank_id) {
+    PathTrie root;
-          prefix->log_prob_b_cur =
+    root.score = root.log_prob_b_prev = 0.0;
-              log_sum_exp(prefix->log_prob_b_cur, log_prob_c + prefix->score);
+    std::vector<PathTrie *> prefixes;
-          continue;
+    prefixes.push_back(&root);
    if (ext_scorer != nullptr && !ext_scorer->is_character_based()) {
        auto fst_dict =
            static_cast<fst::StdVectorFst *>(ext_scorer->dictionary);
        fst::StdVectorFst *dict_ptr = fst_dict->Copy(true);
        root.set_dictionary(dict_ptr);
        auto matcher = std::make_shared<FSTMATCH>(*dict_ptr, fst::MATCH_INPUT);
        root.set_matcher(matcher);
    }
    // prefix search over time
    for (size_t time_step = 0; time_step < num_time_steps; ++time_step) {
        auto &prob = probs_seq[time_step];
        float min_cutoff = -NUM_FLT_INF;
        bool full_beam = false;
        if (ext_scorer != nullptr) {
            size_t num_prefixes = std::min(prefixes.size(), beam_size);
            std::sort(prefixes.begin(),
                      prefixes.begin() + num_prefixes,
                      prefix_compare);
            min_cutoff = prefixes[num_prefixes - 1]->score +
                         std::log(prob[blank_id]) -
                         std::max(0.0, ext_scorer->beta);
            full_beam = (num_prefixes == beam_size);
        }
-        // repeated character
+
-        if (c == prefix->character) {
+        std::vector<std::pair<size_t, float>> log_prob_idx =
-          prefix->log_prob_nb_cur = log_sum_exp(
+            get_pruned_log_probs(prob, cutoff_prob, cutoff_top_n);
-              prefix->log_prob_nb_cur, log_prob_c + prefix->log_prob_nb_prev);
+        // loop over chars
        for (size_t index = 0; index < log_prob_idx.size(); index++) {
            auto c = log_prob_idx[index].first;
            auto log_prob_c = log_prob_idx[index].second;
            for (size_t i = 0; i < prefixes.size() && i < beam_size; ++i) {
                auto prefix = prefixes[i];
                if (full_beam && log_prob_c + prefix->score < min_cutoff) {
                    break;
                }
                // blank
                if (c == blank_id) {
                    prefix->log_prob_b_cur = log_sum_exp(
                        prefix->log_prob_b_cur, log_prob_c + prefix->score);
                    continue;
                }
                // repeated character
                if (c == prefix->character) {
                    prefix->log_prob_nb_cur =
                        log_sum_exp(prefix->log_prob_nb_cur,
                                    log_prob_c + prefix->log_prob_nb_prev);
                }
                // get new prefix
                auto prefix_new = prefix->get_path_trie(c);
                if (prefix_new != nullptr) {
                    float log_p = -NUM_FLT_INF;
                    if (c == prefix->character &&
                        prefix->log_prob_b_prev > -NUM_FLT_INF) {
                        log_p = log_prob_c + prefix->log_prob_b_prev;
                    } else if (c != prefix->character) {
                        log_p = log_prob_c + prefix->score;
                    }
                    // language model scoring
                    if (ext_scorer != nullptr &&
                        (c == space_id || ext_scorer->is_character_based())) {
                        PathTrie *prefix_to_score = nullptr;
                        // skip scoring the space
                        if (ext_scorer->is_character_based()) {
                            prefix_to_score = prefix_new;
                        } else {
                            prefix_to_score = prefix;
                        }
                        float score = 0.0;
                        std::vector<std::string> ngram;
                        ngram = ext_scorer->make_ngram(prefix_to_score);
                        score = ext_scorer->get_log_cond_prob(ngram) *
                                ext_scorer->alpha;
                        log_p += score;
                        log_p += ext_scorer->beta;
                    }
                    prefix_new->log_prob_nb_cur =
                        log_sum_exp(prefix_new->log_prob_nb_cur, log_p);
                }
            }  // end of loop over prefix
        }      // end of loop over vocabulary
        prefixes.clear();
        // update log probs
        root.iterate_to_vec(prefixes);
        // only preserve top beam_size prefixes
        if (prefixes.size() >= beam_size) {
            std::nth_element(prefixes.begin(),
                             prefixes.begin() + beam_size,
                             prefixes.end(),
                             prefix_compare);
            for (size_t i = beam_size; i < prefixes.size(); ++i) {
                prefixes[i]->remove();
            }
        }
-        // get new prefix
+    }  // end of loop over time
-        auto prefix_new = prefix->get_path_trie(c);
+
-
+    // score the last word of each prefix that doesn't end with space
-        if (prefix_new != nullptr) {
+    if (ext_scorer != nullptr && !ext_scorer->is_character_based()) {
-          float log_p = -NUM_FLT_INF;
+        for (size_t i = 0; i < beam_size && i < prefixes.size(); ++i) {
-
+            auto prefix = prefixes[i];
-          if (c == prefix->character &&
+            if (!prefix->is_empty() && prefix->character != space_id) {
-              prefix->log_prob_b_prev > -NUM_FLT_INF) {
+                float score = 0.0;
-            log_p = log_prob_c + prefix->log_prob_b_prev;
+                std::vector<std::string> ngram = ext_scorer->make_ngram(prefix);
-          } else if (c != prefix->character) {
+                score =
-            log_p = log_prob_c + prefix->score;
+                    ext_scorer->get_log_cond_prob(ngram) * ext_scorer->alpha;
-          }
+                score += ext_scorer->beta;
-
+                prefix->score += score;
          // language model scoring
          if (ext_scorer != nullptr &&
              (c == space_id || ext_scorer->is_character_based())) {
            PathTrie *prefix_to_score = nullptr;
            // skip scoring the space
            if (ext_scorer->is_character_based()) {
              prefix_to_score = prefix_new;
            } else {
              prefix_to_score = prefix;
            }
            float score = 0.0;
            std::vector<std::string> ngram;
            ngram = ext_scorer->make_ngram(prefix_to_score);
            score = ext_scorer->get_log_cond_prob(ngram) * ext_scorer->alpha;
            log_p += score;
            log_p += ext_scorer->beta;
          }
          prefix_new->log_prob_nb_cur =
              log_sum_exp(prefix_new->log_prob_nb_cur, log_p);
        }
      }  // end of loop over prefix
    }    // end of loop over vocabulary
    prefixes.clear();
    // update log probs
    root.iterate_to_vec(prefixes);
    // only preserve top beam_size prefixes
    if (prefixes.size() >= beam_size) {
      std::nth_element(prefixes.begin(),
                       prefixes.begin() + beam_size,
                       prefixes.end(),
                       prefix_compare);
      for (size_t i = beam_size; i < prefixes.size(); ++i) {
        prefixes[i]->remove();
      }
    }
  }  // end of loop over time
-  // score the last word of each prefix that doesn't end with space
+    size_t num_prefixes = std::min(prefixes.size(), beam_size);
-  if (ext_scorer != nullptr && !ext_scorer->is_character_based()) {
+    std::sort(
        prefixes.begin(), prefixes.begin() + num_prefixes, prefix_compare);
    // compute aproximate ctc score as the return score, without affecting the
    // return order of decoding result. To delete when decoder gets stable.
    for (size_t i = 0; i < beam_size && i < prefixes.size(); ++i) {
-      auto prefix = prefixes[i];
+        double approx_ctc = prefixes[i]->score;
-      if (!prefix->is_empty() && prefix->character != space_id) {
+        if (ext_scorer != nullptr) {
-        float score = 0.0;
+            std::vector<int> output;
-        std::vector<std::string> ngram = ext_scorer->make_ngram(prefix);
+            prefixes[i]->get_path_vec(output);
-        score = ext_scorer->get_log_cond_prob(ngram) * ext_scorer->alpha;
+            auto prefix_length = output.size();
-        score += ext_scorer->beta;
+            auto words = ext_scorer->split_labels(output);
-        prefix->score += score;
+            // remove word insert
-      }
+            approx_ctc = approx_ctc - prefix_length * ext_scorer->beta;
-    }
+            // remove language model weight:
-  }
+            approx_ctc -=
-
+                (ext_scorer->get_sent_log_prob(words)) * ext_scorer->alpha;
-  size_t num_prefixes = std::min(prefixes.size(), beam_size);
+        }
-  std::sort(prefixes.begin(), prefixes.begin() + num_prefixes, prefix_compare);
+        prefixes[i]->approx_ctc = approx_ctc;
  // compute aproximate ctc score as the return score, without affecting the
  // return order of decoding result. To delete when decoder gets stable.
  for (size_t i = 0; i < beam_size && i < prefixes.size(); ++i) {
    double approx_ctc = prefixes[i]->score;
    if (ext_scorer != nullptr) {
      std::vector<int> output;
      prefixes[i]->get_path_vec(output);
      auto prefix_length = output.size();
      auto words = ext_scorer->split_labels(output);
      // remove word insert
      approx_ctc = approx_ctc - prefix_length * ext_scorer->beta;
      // remove language model weight:
      approx_ctc -= (ext_scorer->get_sent_log_prob(words)) * ext_scorer->alpha;
    }
    prefixes[i]->approx_ctc = approx_ctc;
  }
-  return get_beam_search_result(prefixes, vocabulary, beam_size);
+    return get_beam_search_result(prefixes, vocabulary, beam_size);
 }
@ -209,28 +219,28 @@ ctc_beam_search_decoder_batch(
    double cutoff_prob,
    size_t cutoff_top_n,
    Scorer *ext_scorer) {
-  VALID_CHECK_GT(num_processes, 0, "num_processes must be nonnegative!");
+    VALID_CHECK_GT(num_processes, 0, "num_processes must be nonnegative!");
-  // thread pool
+    // thread pool
-  ThreadPool pool(num_processes);
+    ThreadPool pool(num_processes);
-  // number of samples
+    // number of samples
-  size_t batch_size = probs_split.size();
+    size_t batch_size = probs_split.size();
-
+
-  // enqueue the tasks of decoding
+    // enqueue the tasks of decoding
-  std::vector<std::future<std::vector<std::pair<double, std::string>>>> res;
+    std::vector<std::future<std::vector<std::pair<double, std::string>>>> res;
-  for (size_t i = 0; i < batch_size; ++i) {
+    for (size_t i = 0; i < batch_size; ++i) {
-    res.emplace_back(pool.enqueue(ctc_beam_search_decoder,
+        res.emplace_back(pool.enqueue(ctc_beam_search_decoder,
-                                  probs_split[i],
+                                      probs_split[i],
-                                  vocabulary,
+                                      vocabulary,
-                                  beam_size,
+                                      beam_size,
-                                  cutoff_prob,
+                                      cutoff_prob,
-                                  cutoff_top_n,
+                                      cutoff_top_n,
-                                  ext_scorer));
+                                      ext_scorer));
-  }
+    }
-
+
-  // get decoding results
+    // get decoding results
-  std::vector<std::vector<std::pair<double, std::string>>> batch_results;
+    std::vector<std::vector<std::pair<double, std::string>>> batch_results;
-  for (size_t i = 0; i < batch_size; ++i) {
+    for (size_t i = 0; i < batch_size; ++i) {
-    batch_results.emplace_back(res[i].get());
+        batch_results.emplace_back(res[i].get());
-  }
+    }
-  return batch_results;
+    return batch_results;
 }
--- a/deepspeech/decoders/swig/ctc_greedy_decoder.cpp
+++ b/deepspeech/decoders/swig/ctc_greedy_decoder.cpp
@ -18,42 +18,42 @@
 std::string ctc_greedy_decoder(
    const std::vector<std::vector<double>> &probs_seq,
    const std::vector<std::string> &vocabulary) {
-  // dimension check
+    // dimension check
-  size_t num_time_steps = probs_seq.size();
+    size_t num_time_steps = probs_seq.size();
-  for (size_t i = 0; i < num_time_steps; ++i) {
+    for (size_t i = 0; i < num_time_steps; ++i) {
-    VALID_CHECK_EQ(probs_seq[i].size(),
+        VALID_CHECK_EQ(probs_seq[i].size(),
-                   vocabulary.size() + 1,
+                       vocabulary.size() + 1,
-                   "The shape of probs_seq does not match with "
+                       "The shape of probs_seq does not match with "
-                   "the shape of the vocabulary");
+                       "the shape of the vocabulary");
-  }
+    }
-  size_t blank_id = vocabulary.size();
+    size_t blank_id = vocabulary.size();
-  std::vector<size_t> max_idx_vec(num_time_steps, 0);
+    std::vector<size_t> max_idx_vec(num_time_steps, 0);
-  std::vector<size_t> idx_vec;
+    std::vector<size_t> idx_vec;
-  for (size_t i = 0; i < num_time_steps; ++i) {
+    for (size_t i = 0; i < num_time_steps; ++i) {
-    double max_prob = 0.0;
+        double max_prob = 0.0;
-    size_t max_idx = 0;
+        size_t max_idx = 0;
-    const std::vector<double> &probs_step = probs_seq[i];
+        const std::vector<double> &probs_step = probs_seq[i];
-    for (size_t j = 0; j < probs_step.size(); ++j) {
+        for (size_t j = 0; j < probs_step.size(); ++j) {
-      if (max_prob < probs_step[j]) {
+            if (max_prob < probs_step[j]) {
-        max_idx = j;
+                max_idx = j;
-        max_prob = probs_step[j];
+                max_prob = probs_step[j];
-      }
+            }
-    }
+        }
-    // id with maximum probability in current time step
+        // id with maximum probability in current time step
-    max_idx_vec[i] = max_idx;
+        max_idx_vec[i] = max_idx;
-    // deduplicate
+        // deduplicate
-    if ((i == 0) || ((i > 0) && max_idx_vec[i] != max_idx_vec[i - 1])) {
+        if ((i == 0) || ((i > 0) && max_idx_vec[i] != max_idx_vec[i - 1])) {
-      idx_vec.push_back(max_idx_vec[i]);
+            idx_vec.push_back(max_idx_vec[i]);
        }
    }
  }
-  std::string best_path_result;
+    std::string best_path_result;
-  for (size_t i = 0; i < idx_vec.size(); ++i) {
+    for (size_t i = 0; i < idx_vec.size(); ++i) {
-    if (idx_vec[i] != blank_id) {
+        if (idx_vec[i] != blank_id) {
-      best_path_result += vocabulary[idx_vec[i]];
+            best_path_result += vocabulary[idx_vec[i]];
        }
    }
-  }
+    return best_path_result;
  return best_path_result;
 }
--- a/deepspeech/decoders/swig/decoder_utils.cpp
+++ b/deepspeech/decoders/swig/decoder_utils.cpp
@ -22,33 +22,35 @@ std::vector<std::pair<size_t, float>> get_pruned_log_probs(
    const std::vector<double> &prob_step,
    double cutoff_prob,
    size_t cutoff_top_n) {
-  std::vector<std::pair<int, double>> prob_idx;
+    std::vector<std::pair<int, double>> prob_idx;
-  for (size_t i = 0; i < prob_step.size(); ++i) {
+    for (size_t i = 0; i < prob_step.size(); ++i) {
-    prob_idx.push_back(std::pair<int, double>(i, prob_step[i]));
+        prob_idx.push_back(std::pair<int, double>(i, prob_step[i]));
  }
  // pruning of vacobulary
  size_t cutoff_len = prob_step.size();
  if (cutoff_prob < 1.0 || cutoff_top_n < cutoff_len) {
    std::sort(
        prob_idx.begin(), prob_idx.end(), pair_comp_second_rev<int, double>);
    if (cutoff_prob < 1.0) {
      double cum_prob = 0.0;
      cutoff_len = 0;
      for (size_t i = 0; i < prob_idx.size(); ++i) {
        cum_prob += prob_idx[i].second;
        cutoff_len += 1;
        if (cum_prob >= cutoff_prob || cutoff_len >= cutoff_top_n) break;
      }
    }
-    prob_idx = std::vector<std::pair<int, double>>(
+    // pruning of vacobulary
-        prob_idx.begin(), prob_idx.begin() + cutoff_len);
+    size_t cutoff_len = prob_step.size();
-  }
+    if (cutoff_prob < 1.0 || cutoff_top_n < cutoff_len) {
-  std::vector<std::pair<size_t, float>> log_prob_idx;
+        std::sort(prob_idx.begin(),
-  for (size_t i = 0; i < cutoff_len; ++i) {
+                  prob_idx.end(),
-    log_prob_idx.push_back(std::pair<int, float>(
+                  pair_comp_second_rev<int, double>);
-        prob_idx[i].first, log(prob_idx[i].second + NUM_FLT_MIN)));
+        if (cutoff_prob < 1.0) {
-  }
+            double cum_prob = 0.0;
-  return log_prob_idx;
+            cutoff_len = 0;
            for (size_t i = 0; i < prob_idx.size(); ++i) {
                cum_prob += prob_idx[i].second;
                cutoff_len += 1;
                if (cum_prob >= cutoff_prob || cutoff_len >= cutoff_top_n)
                    break;
            }
        }
        prob_idx = std::vector<std::pair<int, double>>(
            prob_idx.begin(), prob_idx.begin() + cutoff_len);
    }
    std::vector<std::pair<size_t, float>> log_prob_idx;
    for (size_t i = 0; i < cutoff_len; ++i) {
        log_prob_idx.push_back(std::pair<int, float>(
            prob_idx[i].first, log(prob_idx[i].second + NUM_FLT_MIN)));
    }
    return log_prob_idx;
 }
@ -56,106 +58,106 @@ std::vector<std::pair<double, std::string>> get_beam_search_result(
    const std::vector<PathTrie *> &prefixes,
    const std::vector<std::string> &vocabulary,
    size_t beam_size) {
-  // allow for the post processing
+    // allow for the post processing
-  std::vector<PathTrie *> space_prefixes;
+    std::vector<PathTrie *> space_prefixes;
-  if (space_prefixes.empty()) {
+    if (space_prefixes.empty()) {
-    for (size_t i = 0; i < beam_size && i < prefixes.size(); ++i) {
+        for (size_t i = 0; i < beam_size && i < prefixes.size(); ++i) {
-      space_prefixes.push_back(prefixes[i]);
+            space_prefixes.push_back(prefixes[i]);
        }
    }
-  }
+
-
+    std::sort(space_prefixes.begin(), space_prefixes.end(), prefix_compare);
-  std::sort(space_prefixes.begin(), space_prefixes.end(), prefix_compare);
+    std::vector<std::pair<double, std::string>> output_vecs;
-  std::vector<std::pair<double, std::string>> output_vecs;
+    for (size_t i = 0; i < beam_size && i < space_prefixes.size(); ++i) {
-  for (size_t i = 0; i < beam_size && i < space_prefixes.size(); ++i) {
+        std::vector<int> output;
-    std::vector<int> output;
+        space_prefixes[i]->get_path_vec(output);
-    space_prefixes[i]->get_path_vec(output);
+        // convert index to string
-    // convert index to string
+        std::string output_str;
-    std::string output_str;
+        for (size_t j = 0; j < output.size(); j++) {
-    for (size_t j = 0; j < output.size(); j++) {
+            output_str += vocabulary[output[j]];
-      output_str += vocabulary[output[j]];
+        }
        std::pair<double, std::string> output_pair(
            -space_prefixes[i]->approx_ctc, output_str);
        output_vecs.emplace_back(output_pair);
    }
    std::pair<double, std::string> output_pair(-space_prefixes[i]->approx_ctc,
                                               output_str);
    output_vecs.emplace_back(output_pair);
  }
-  return output_vecs;
+    return output_vecs;
 }
 size_t get_utf8_str_len(const std::string &str) {
-  size_t str_len = 0;
+    size_t str_len = 0;
-  for (char c : str) {
+    for (char c : str) {
-    str_len += ((c & 0xc0) != 0x80);
+        str_len += ((c & 0xc0) != 0x80);
-  }
+    }
-  return str_len;
+    return str_len;
 }
 std::vector<std::string> split_utf8_str(const std::string &str) {
-  std::vector<std::string> result;
+    std::vector<std::string> result;
-  std::string out_str;
+    std::string out_str;
-
+
-  for (char c : str) {
+    for (char c : str) {
-    if ((c & 0xc0) != 0x80)  // new UTF-8 character
+        if ((c & 0xc0) != 0x80)  // new UTF-8 character
-    {
+        {
-      if (!out_str.empty()) {
+            if (!out_str.empty()) {
-        result.push_back(out_str);
+                result.push_back(out_str);
-        out_str.clear();
+                out_str.clear();
-      }
+            }
        }
        out_str.append(1, c);
    }
-
+    result.push_back(out_str);
-    out_str.append(1, c);
+    return result;
  }
  result.push_back(out_str);
  return result;
 }
 std::vector<std::string> split_str(const std::string &s,
                                   const std::string &delim) {
-  std::vector<std::string> result;
+    std::vector<std::string> result;
-  std::size_t start = 0, delim_len = delim.size();
+    std::size_t start = 0, delim_len = delim.size();
-  while (true) {
+    while (true) {
-    std::size_t end = s.find(delim, start);
+        std::size_t end = s.find(delim, start);
-    if (end == std::string::npos) {
+        if (end == std::string::npos) {
-      if (start < s.size()) {
+            if (start < s.size()) {
-        result.push_back(s.substr(start));
+                result.push_back(s.substr(start));
-      }
+            }
-      break;
+            break;
-    }
+        }
-    if (end > start) {
+        if (end > start) {
-      result.push_back(s.substr(start, end - start));
+            result.push_back(s.substr(start, end - start));
        }
        start = end + delim_len;
    }
-    start = end + delim_len;
+    return result;
  }
  return result;
 }
 bool prefix_compare(const PathTrie *x, const PathTrie *y) {
-  if (x->score == y->score) {
+    if (x->score == y->score) {
-    if (x->character == y->character) {
+        if (x->character == y->character) {
-      return false;
+            return false;
        } else {
            return (x->character < y->character);
        }
    } else {
-      return (x->character < y->character);
+        return x->score > y->score;
    }
  } else {
    return x->score > y->score;
  }
 }
 void add_word_to_fst(const std::vector<int> &word,
                     fst::StdVectorFst *dictionary) {
-  if (dictionary->NumStates() == 0) {
+    if (dictionary->NumStates() == 0) {
-    fst::StdVectorFst::StateId start = dictionary->AddState();
+        fst::StdVectorFst::StateId start = dictionary->AddState();
-    assert(start == 0);
+        assert(start == 0);
-    dictionary->SetStart(start);
+        dictionary->SetStart(start);
-  }
+    }
-  fst::StdVectorFst::StateId src = dictionary->Start();
+    fst::StdVectorFst::StateId src = dictionary->Start();
-  fst::StdVectorFst::StateId dst;
+    fst::StdVectorFst::StateId dst;
-  for (auto c : word) {
+    for (auto c : word) {
-    dst = dictionary->AddState();
+        dst = dictionary->AddState();
-    dictionary->AddArc(src, fst::StdArc(c, c, 0, dst));
+        dictionary->AddArc(src, fst::StdArc(c, c, 0, dst));
-    src = dst;
+        src = dst;
-  }
+    }
-  dictionary->SetFinal(dst, fst::StdArc::Weight::One());
+    dictionary->SetFinal(dst, fst::StdArc::Weight::One());
 }
 bool add_word_to_dictionary(
@ -164,27 +166,27 @@ bool add_word_to_dictionary(
    bool add_space,
    int SPACE_ID,
    fst::StdVectorFst *dictionary) {
-  auto characters = split_utf8_str(word);
+    auto characters = split_utf8_str(word);
-
+
-  std::vector<int> int_word;
+    std::vector<int> int_word;
-
+
-  for (auto &c : characters) {
+    for (auto &c : characters) {
-    if (c == " ") {
+        if (c == " ") {
-      int_word.push_back(SPACE_ID);
+            int_word.push_back(SPACE_ID);
-    } else {
+        } else {
-      auto int_c = char_map.find(c);
+            auto int_c = char_map.find(c);
-      if (int_c != char_map.end()) {
+            if (int_c != char_map.end()) {
-        int_word.push_back(int_c->second);
+                int_word.push_back(int_c->second);
-      } else {
+            } else {
-        return false;  // return without adding
+                return false;  // return without adding
-      }
+            }
        }
    }
  }
-  if (add_space) {
+    if (add_space) {
-    int_word.push_back(SPACE_ID);
+        int_word.push_back(SPACE_ID);
-  }
+    }
-  add_word_to_fst(int_word, dictionary);
+    add_word_to_fst(int_word, dictionary);
-  return true;  // return with successful adding
+    return true;  // return with successful adding
 }
--- a/deepspeech/decoders/swig/decoder_utils.h
+++ b/deepspeech/decoders/swig/decoder_utils.h
@ -25,14 +25,14 @@ const float NUM_FLT_MIN = std::numeric_limits<float>::min();
 // inline function for validation check
 inline void check(
    bool x, const char *expr, const char *file, int line, const char *err) {
-  if (!x) {
+    if (!x) {
-    std::cout << "[" << file << ":" << line << "] ";
+        std::cout << "[" << file << ":" << line << "] ";
-    LOG(FATAL) << "\"" << expr << "\" check failed. " << err;
+        LOG(FATAL) << "\"" << expr << "\" check failed. " << err;
-  }
+    }
 }
 #define VALID_CHECK(x, info) \
-  check(static_cast<bool>(x), #x, __FILE__, __LINE__, info)
+    check(static_cast<bool>(x), #x, __FILE__, __LINE__, info)
 #define VALID_CHECK_EQ(x, y, info) VALID_CHECK((x) == (y), info)
 #define VALID_CHECK_GT(x, y, info) VALID_CHECK((x) > (y), info)
 #define VALID_CHECK_LT(x, y, info) VALID_CHECK((x) < (y), info)
@ -42,24 +42,24 @@ inline void check(
 template <typename T1, typename T2>
 bool pair_comp_first_rev(const std::pair<T1, T2> &a,
                         const std::pair<T1, T2> &b) {
-  return a.first > b.first;
+    return a.first > b.first;
 }
 // Function template for comparing two pairs
 template <typename T1, typename T2>
 bool pair_comp_second_rev(const std::pair<T1, T2> &a,
                          const std::pair<T1, T2> &b) {
-  return a.second > b.second;
+    return a.second > b.second;
 }
 // Return the sum of two probabilities in log scale
 template <typename T>
 T log_sum_exp(const T &x, const T &y) {
-  static T num_min = -std::numeric_limits<T>::max();
+    static T num_min = -std::numeric_limits<T>::max();
-  if (x <= num_min) return y;
+    if (x <= num_min) return y;
-  if (y <= num_min) return x;
+    if (y <= num_min) return x;
-  T xmax = std::max(x, y);
+    T xmax = std::max(x, y);
-  return std::log(std::exp(x - xmax) + std::exp(y - xmax)) + xmax;
+    return std::log(std::exp(x - xmax) + std::exp(y - xmax)) + xmax;
 }
 // Get pruned probability vector for each time step's beam search
--- a/deepspeech/decoders/swig/path_trie.cpp
+++ b/deepspeech/decoders/swig/path_trie.cpp
@ -23,140 +23,141 @@
 #include "decoder_utils.h"
 PathTrie::PathTrie() {
-  log_prob_b_prev = -NUM_FLT_INF;
+    log_prob_b_prev = -NUM_FLT_INF;
-  log_prob_nb_prev = -NUM_FLT_INF;
+    log_prob_nb_prev = -NUM_FLT_INF;
-  log_prob_b_cur = -NUM_FLT_INF;
+    log_prob_b_cur = -NUM_FLT_INF;
-  log_prob_nb_cur = -NUM_FLT_INF;
+    log_prob_nb_cur = -NUM_FLT_INF;
-  score = -NUM_FLT_INF;
+    score = -NUM_FLT_INF;
-
+
-  ROOT_ = -1;
+    ROOT_ = -1;
-  character = ROOT_;
+    character = ROOT_;
-  exists_ = true;
+    exists_ = true;
-  parent = nullptr;
+    parent = nullptr;
-
+
-  dictionary_ = nullptr;
+    dictionary_ = nullptr;
-  dictionary_state_ = 0;
+    dictionary_state_ = 0;
-  has_dictionary_ = false;
+    has_dictionary_ = false;
-
+
-  matcher_ = nullptr;
+    matcher_ = nullptr;
 }
 PathTrie::~PathTrie() {
-  for (auto child : children_) {
+    for (auto child : children_) {
-    delete child.second;
+        delete child.second;
-  }
+    }
 }
 PathTrie* PathTrie::get_path_trie(int new_char, bool reset) {
-  auto child = children_.begin();
+    auto child = children_.begin();
-  for (child = children_.begin(); child != children_.end(); ++child) {
+    for (child = children_.begin(); child != children_.end(); ++child) {
-    if (child->first == new_char) {
+        if (child->first == new_char) {
-      break;
+            break;
-    }
+        }
  }
  if (child != children_.end()) {
    if (!child->second->exists_) {
      child->second->exists_ = true;
      child->second->log_prob_b_prev = -NUM_FLT_INF;
      child->second->log_prob_nb_prev = -NUM_FLT_INF;
      child->second->log_prob_b_cur = -NUM_FLT_INF;
      child->second->log_prob_nb_cur = -NUM_FLT_INF;
    }
-    return (child->second);
+    if (child != children_.end()) {
-  } else {
+        if (!child->second->exists_) {
-    if (has_dictionary_) {
+            child->second->exists_ = true;
-      matcher_->SetState(dictionary_state_);
+            child->second->log_prob_b_prev = -NUM_FLT_INF;
-      bool found = matcher_->Find(new_char + 1);
+            child->second->log_prob_nb_prev = -NUM_FLT_INF;
-      if (!found) {
+            child->second->log_prob_b_cur = -NUM_FLT_INF;
-        // Adding this character causes word outside dictionary
+            child->second->log_prob_nb_cur = -NUM_FLT_INF;
        auto FSTZERO = fst::TropicalWeight::Zero();
        auto final_weight = dictionary_->Final(dictionary_state_);
        bool is_final = (final_weight != FSTZERO);
        if (is_final && reset) {
          dictionary_state_ = dictionary_->Start();
        }
-        return nullptr;
+        return (child->second);
      } else {
        PathTrie* new_path = new PathTrie;
        new_path->character = new_char;
        new_path->parent = this;
        new_path->dictionary_ = dictionary_;
        new_path->dictionary_state_ = matcher_->Value().nextstate;
        new_path->has_dictionary_ = true;
        new_path->matcher_ = matcher_;
        children_.push_back(std::make_pair(new_char, new_path));
        return new_path;
      }
    } else {
-      PathTrie* new_path = new PathTrie;
+        if (has_dictionary_) {
-      new_path->character = new_char;
+            matcher_->SetState(dictionary_state_);
-      new_path->parent = this;
+            bool found = matcher_->Find(new_char + 1);
-      children_.push_back(std::make_pair(new_char, new_path));
+            if (!found) {
-      return new_path;
+                // Adding this character causes word outside dictionary
                auto FSTZERO = fst::TropicalWeight::Zero();
                auto final_weight = dictionary_->Final(dictionary_state_);
                bool is_final = (final_weight != FSTZERO);
                if (is_final && reset) {
                    dictionary_state_ = dictionary_->Start();
                }
                return nullptr;
            } else {
                PathTrie* new_path = new PathTrie;
                new_path->character = new_char;
                new_path->parent = this;
                new_path->dictionary_ = dictionary_;
                new_path->dictionary_state_ = matcher_->Value().nextstate;
                new_path->has_dictionary_ = true;
                new_path->matcher_ = matcher_;
                children_.push_back(std::make_pair(new_char, new_path));
                return new_path;
            }
        } else {
            PathTrie* new_path = new PathTrie;
            new_path->character = new_char;
            new_path->parent = this;
            children_.push_back(std::make_pair(new_char, new_path));
            return new_path;
        }
    }
  }
 }
 PathTrie* PathTrie::get_path_vec(std::vector<int>& output) {
-  return get_path_vec(output, ROOT_);
+    return get_path_vec(output, ROOT_);
 }
 PathTrie* PathTrie::get_path_vec(std::vector<int>& output,
                                 int stop,
                                 size_t max_steps) {
-  if (character == stop || character == ROOT_ || output.size() == max_steps) {
+    if (character == stop || character == ROOT_ || output.size() == max_steps) {
-    std::reverse(output.begin(), output.end());
+        std::reverse(output.begin(), output.end());
-    return this;
+        return this;
-  } else {
+    } else {
-    output.push_back(character);
+        output.push_back(character);
-    return parent->get_path_vec(output, stop, max_steps);
+        return parent->get_path_vec(output, stop, max_steps);
-  }
+    }
 }
 void PathTrie::iterate_to_vec(std::vector<PathTrie*>& output) {
-  if (exists_) {
+    if (exists_) {
-    log_prob_b_prev = log_prob_b_cur;
+        log_prob_b_prev = log_prob_b_cur;
-    log_prob_nb_prev = log_prob_nb_cur;
+        log_prob_nb_prev = log_prob_nb_cur;
-    log_prob_b_cur = -NUM_FLT_INF;
+        log_prob_b_cur = -NUM_FLT_INF;
-    log_prob_nb_cur = -NUM_FLT_INF;
+        log_prob_nb_cur = -NUM_FLT_INF;
-    score = log_sum_exp(log_prob_b_prev, log_prob_nb_prev);
+        score = log_sum_exp(log_prob_b_prev, log_prob_nb_prev);
-    output.push_back(this);
+        output.push_back(this);
-  }
+    }
-  for (auto child : children_) {
+    for (auto child : children_) {
-    child.second->iterate_to_vec(output);
+        child.second->iterate_to_vec(output);
-  }
+    }
 }
 void PathTrie::remove() {
-  exists_ = false;
+    exists_ = false;
-
+
-  if (children_.size() == 0) {
+    if (children_.size() == 0) {
-    auto child = parent->children_.begin();
+        auto child = parent->children_.begin();
-    for (child = parent->children_.begin(); child != parent->children_.end();
+        for (child = parent->children_.begin();
-         ++child) {
+             child != parent->children_.end();
-      if (child->first == character) {
+             ++child) {
-        parent->children_.erase(child);
+            if (child->first == character) {
-        break;
+                parent->children_.erase(child);
-      }
+                break;
-    }
+            }
        }
-    if (parent->children_.size() == 0 && !parent->exists_) {
+        if (parent->children_.size() == 0 && !parent->exists_) {
-      parent->remove();
+            parent->remove();
-    }
+        }
-    delete this;
+        delete this;
-  }
+    }
 }
 void PathTrie::set_dictionary(fst::StdVectorFst* dictionary) {
-  dictionary_ = dictionary;
+    dictionary_ = dictionary;
-  dictionary_state_ = dictionary->Start();
+    dictionary_state_ = dictionary->Start();
-  has_dictionary_ = true;
+    has_dictionary_ = true;
 }
 using FSTMATCH = fst::SortedMatcher<fst::StdVectorFst>;
 void PathTrie::set_matcher(std::shared_ptr<FSTMATCH> matcher) {
-  matcher_ = matcher;
+    matcher_ = matcher;
 }
--- a/deepspeech/decoders/swig/path_trie.h
+++ b/deepspeech/decoders/swig/path_trie.h
@ -27,55 +27,56 @@
 * finite-state transducer for spelling correction.
 */
 class PathTrie {
-public:
+  public:
-  PathTrie();
+    PathTrie();
-  ~PathTrie();
+    ~PathTrie();
-  // get new prefix after appending new char
+    // get new prefix after appending new char
-  PathTrie* get_path_trie(int new_char, bool reset = true);
+    PathTrie* get_path_trie(int new_char, bool reset = true);
-  // get the prefix in index from root to current node
+    // get the prefix in index from root to current node
-  PathTrie* get_path_vec(std::vector<int>& output);
+    PathTrie* get_path_vec(std::vector<int>& output);
-  // get the prefix in index from some stop node to current nodel
+    // get the prefix in index from some stop node to current nodel
-  PathTrie* get_path_vec(std::vector<int>& output,
+    PathTrie* get_path_vec(
-                         int stop,
+        std::vector<int>& output,
-                         size_t max_steps = std::numeric_limits<size_t>::max());
+        int stop,
        size_t max_steps = std::numeric_limits<size_t>::max());
-  // update log probs
+    // update log probs
-  void iterate_to_vec(std::vector<PathTrie*>& output);
+    void iterate_to_vec(std::vector<PathTrie*>& output);
-  // set dictionary for FST
+    // set dictionary for FST
-  void set_dictionary(fst::StdVectorFst* dictionary);
+    void set_dictionary(fst::StdVectorFst* dictionary);
-  void set_matcher(std::shared_ptr<fst::SortedMatcher<fst::StdVectorFst>>);
+    void set_matcher(std::shared_ptr<fst::SortedMatcher<fst::StdVectorFst>>);
-  bool is_empty() { return ROOT_ == character; }
+    bool is_empty() { return ROOT_ == character; }
-  // remove current path from root
+    // remove current path from root
-  void remove();
+    void remove();
-  float log_prob_b_prev;
+    float log_prob_b_prev;
-  float log_prob_nb_prev;
+    float log_prob_nb_prev;
-  float log_prob_b_cur;
+    float log_prob_b_cur;
-  float log_prob_nb_cur;
+    float log_prob_nb_cur;
-  float score;
+    float score;
-  float approx_ctc;
+    float approx_ctc;
-  int character;
+    int character;
-  PathTrie* parent;
+    PathTrie* parent;
-private:
+  private:
-  int ROOT_;
+    int ROOT_;
-  bool exists_;
+    bool exists_;
-  bool has_dictionary_;
+    bool has_dictionary_;
-  std::vector<std::pair<int, PathTrie*>> children_;
+    std::vector<std::pair<int, PathTrie*>> children_;
-  // pointer to dictionary of FST
+    // pointer to dictionary of FST
-  fst::StdVectorFst* dictionary_;
+    fst::StdVectorFst* dictionary_;
-  fst::StdVectorFst::StateId dictionary_state_;
+    fst::StdVectorFst::StateId dictionary_state_;
-  // true if finding ars in FST
+    // true if finding ars in FST
-  std::shared_ptr<fst::SortedMatcher<fst::StdVectorFst>> matcher_;
+    std::shared_ptr<fst::SortedMatcher<fst::StdVectorFst>> matcher_;
 };
 #endif  // PATH_TRIE_H
--- a/deepspeech/decoders/swig/scorer.cpp
+++ b/deepspeech/decoders/swig/scorer.cpp
@ -31,214 +31,214 @@ Scorer::Scorer(double alpha,
               double beta,
               const std::string& lm_path,
               const std::vector<std::string>& vocab_list) {
-  this->alpha = alpha;
+    this->alpha = alpha;
-  this->beta = beta;
+    this->beta = beta;
-  dictionary = nullptr;
+    dictionary = nullptr;
-  is_character_based_ = true;
+    is_character_based_ = true;
-  language_model_ = nullptr;
+    language_model_ = nullptr;
-  max_order_ = 0;
+    max_order_ = 0;
-  dict_size_ = 0;
+    dict_size_ = 0;
-  SPACE_ID_ = -1;
+    SPACE_ID_ = -1;
-  setup(lm_path, vocab_list);
+    setup(lm_path, vocab_list);
 }
 Scorer::~Scorer() {
-  if (language_model_ != nullptr) {
+    if (language_model_ != nullptr) {
-    delete static_cast<lm::base::Model*>(language_model_);
+        delete static_cast<lm::base::Model*>(language_model_);
-  }
+    }
-  if (dictionary != nullptr) {
+    if (dictionary != nullptr) {
-    delete static_cast<fst::StdVectorFst*>(dictionary);
+        delete static_cast<fst::StdVectorFst*>(dictionary);
-  }
+    }
 }
 void Scorer::setup(const std::string& lm_path,
                   const std::vector<std::string>& vocab_list) {
-  // load language model
+    // load language model
-  load_lm(lm_path);
+    load_lm(lm_path);
-  // set char map for scorer
+    // set char map for scorer
-  set_char_map(vocab_list);
+    set_char_map(vocab_list);
-  // fill the dictionary for FST
+    // fill the dictionary for FST
-  if (!is_character_based()) {
+    if (!is_character_based()) {
-    fill_dictionary(true);
+        fill_dictionary(true);
-  }
+    }
 }
 void Scorer::load_lm(const std::string& lm_path) {
-  const char* filename = lm_path.c_str();
+    const char* filename = lm_path.c_str();
-  VALID_CHECK_EQ(access(filename, F_OK), 0, "Invalid language model path");
+    VALID_CHECK_EQ(access(filename, F_OK), 0, "Invalid language model path");
-
+
-  RetriveStrEnumerateVocab enumerate;
+    RetriveStrEnumerateVocab enumerate;
-  lm::ngram::Config config;
+    lm::ngram::Config config;
-  config.enumerate_vocab = &enumerate;
+    config.enumerate_vocab = &enumerate;
-  language_model_ = lm::ngram::LoadVirtual(filename, config);
+    language_model_ = lm::ngram::LoadVirtual(filename, config);
-  max_order_ = static_cast<lm::base::Model*>(language_model_)->Order();
+    max_order_ = static_cast<lm::base::Model*>(language_model_)->Order();
-  vocabulary_ = enumerate.vocabulary;
+    vocabulary_ = enumerate.vocabulary;
-  for (size_t i = 0; i < vocabulary_.size(); ++i) {
+    for (size_t i = 0; i < vocabulary_.size(); ++i) {
-    if (is_character_based_ && vocabulary_[i] != UNK_TOKEN &&
+        if (is_character_based_ && vocabulary_[i] != UNK_TOKEN &&
-        vocabulary_[i] != START_TOKEN && vocabulary_[i] != END_TOKEN &&
+            vocabulary_[i] != START_TOKEN && vocabulary_[i] != END_TOKEN &&
-        get_utf8_str_len(enumerate.vocabulary[i]) > 1) {
+            get_utf8_str_len(enumerate.vocabulary[i]) > 1) {
-      is_character_based_ = false;
+            is_character_based_ = false;
        }
    }
  }
 }
 double Scorer::get_log_cond_prob(const std::vector<std::string>& words) {
-  lm::base::Model* model = static_cast<lm::base::Model*>(language_model_);
+    lm::base::Model* model = static_cast<lm::base::Model*>(language_model_);
-  double cond_prob;
+    double cond_prob;
-  lm::ngram::State state, tmp_state, out_state;
+    lm::ngram::State state, tmp_state, out_state;
-  // avoid to inserting <s> in begin
+    // avoid to inserting <s> in begin
-  model->NullContextWrite(&state);
+    model->NullContextWrite(&state);
-  for (size_t i = 0; i < words.size(); ++i) {
+    for (size_t i = 0; i < words.size(); ++i) {
-    lm::WordIndex word_index = model->BaseVocabulary().Index(words[i]);
+        lm::WordIndex word_index = model->BaseVocabulary().Index(words[i]);
-    // encounter OOV
+        // encounter OOV
-    if (word_index == 0) {
+        if (word_index == 0) {
-      return OOV_SCORE;
+            return OOV_SCORE;
        }
        cond_prob = model->BaseScore(&state, word_index, &out_state);
        tmp_state = state;
        state = out_state;
        out_state = tmp_state;
    }
-    cond_prob = model->BaseScore(&state, word_index, &out_state);
+    // return  log10 prob
-    tmp_state = state;
+    return cond_prob;
    state = out_state;
    out_state = tmp_state;
  }
  // return  log10 prob
  return cond_prob;
 }
 double Scorer::get_sent_log_prob(const std::vector<std::string>& words) {
-  std::vector<std::string> sentence;
+    std::vector<std::string> sentence;
-  if (words.size() == 0) {
+    if (words.size() == 0) {
-    for (size_t i = 0; i < max_order_; ++i) {
+        for (size_t i = 0; i < max_order_; ++i) {
-      sentence.push_back(START_TOKEN);
+            sentence.push_back(START_TOKEN);
-    }
+        }
-  } else {
+    } else {
-    for (size_t i = 0; i < max_order_ - 1; ++i) {
+        for (size_t i = 0; i < max_order_ - 1; ++i) {
-      sentence.push_back(START_TOKEN);
+            sentence.push_back(START_TOKEN);
        }
        sentence.insert(sentence.end(), words.begin(), words.end());
    }
-    sentence.insert(sentence.end(), words.begin(), words.end());
+    sentence.push_back(END_TOKEN);
-  }
+    return get_log_prob(sentence);
  sentence.push_back(END_TOKEN);
  return get_log_prob(sentence);
 }
 double Scorer::get_log_prob(const std::vector<std::string>& words) {
-  assert(words.size() > max_order_);
+    assert(words.size() > max_order_);
-  double score = 0.0;
+    double score = 0.0;
-  for (size_t i = 0; i < words.size() - max_order_ + 1; ++i) {
+    for (size_t i = 0; i < words.size() - max_order_ + 1; ++i) {
-    std::vector<std::string> ngram(words.begin() + i,
+        std::vector<std::string> ngram(words.begin() + i,
-                                   words.begin() + i + max_order_);
+                                       words.begin() + i + max_order_);
-    score += get_log_cond_prob(ngram);
+        score += get_log_cond_prob(ngram);
-  }
+    }
-  return score;
+    return score;
 }
 void Scorer::reset_params(float alpha, float beta) {
-  this->alpha = alpha;
+    this->alpha = alpha;
-  this->beta = beta;
+    this->beta = beta;
 }
 std::string Scorer::vec2str(const std::vector<int>& input) {
-  std::string word;
+    std::string word;
-  for (auto ind : input) {
+    for (auto ind : input) {
-    word += char_list_[ind];
+        word += char_list_[ind];
-  }
+    }
-  return word;
+    return word;
 }
 std::vector<std::string> Scorer::split_labels(const std::vector<int>& labels) {
-  if (labels.empty()) return {};
+    if (labels.empty()) return {};
-
+
-  std::string s = vec2str(labels);
+    std::string s = vec2str(labels);
-  std::vector<std::string> words;
+    std::vector<std::string> words;
-  if (is_character_based_) {
+    if (is_character_based_) {
-    words = split_utf8_str(s);
+        words = split_utf8_str(s);
-  } else {
+    } else {
-    words = split_str(s, " ");
+        words = split_str(s, " ");
-  }
+    }
-  return words;
+    return words;
 }
 void Scorer::set_char_map(const std::vector<std::string>& char_list) {
-  char_list_ = char_list;
+    char_list_ = char_list;
-  char_map_.clear();
+    char_map_.clear();
-
+
-  // Set the char map for the FST for spelling correction
+    // Set the char map for the FST for spelling correction
-  for (size_t i = 0; i < char_list_.size(); i++) {
+    for (size_t i = 0; i < char_list_.size(); i++) {
-    if (char_list_[i] == " ") {
+        if (char_list_[i] == " ") {
-      SPACE_ID_ = i;
+            SPACE_ID_ = i;
        }
        // The initial state of FST is state 0, hence the index of chars in
        // the FST should start from 1 to avoid the conflict with the initial
        // state, otherwise wrong decoding results would be given.
        char_map_[char_list_[i]] = i + 1;
    }
    // The initial state of FST is state 0, hence the index of chars in
    // the FST should start from 1 to avoid the conflict with the initial
    // state, otherwise wrong decoding results would be given.
    char_map_[char_list_[i]] = i + 1;
  }
 }
 std::vector<std::string> Scorer::make_ngram(PathTrie* prefix) {
-  std::vector<std::string> ngram;
+    std::vector<std::string> ngram;
-  PathTrie* current_node = prefix;
+    PathTrie* current_node = prefix;
-  PathTrie* new_node = nullptr;
+    PathTrie* new_node = nullptr;
-
+
-  for (int order = 0; order < max_order_; order++) {
+    for (int order = 0; order < max_order_; order++) {
-    std::vector<int> prefix_vec;
+        std::vector<int> prefix_vec;
-
+
-    if (is_character_based_) {
+        if (is_character_based_) {
-      new_node = current_node->get_path_vec(prefix_vec, SPACE_ID_, 1);
+            new_node = current_node->get_path_vec(prefix_vec, SPACE_ID_, 1);
-      current_node = new_node;
+            current_node = new_node;
-    } else {
+        } else {
-      new_node = current_node->get_path_vec(prefix_vec, SPACE_ID_);
+            new_node = current_node->get_path_vec(prefix_vec, SPACE_ID_);
-      current_node = new_node->parent;  // Skipping spaces
+            current_node = new_node->parent;  // Skipping spaces
        }
        // reconstruct word
        std::string word = vec2str(prefix_vec);
        ngram.push_back(word);
        if (new_node->character == -1) {
            // No more spaces, but still need order
            for (int i = 0; i < max_order_ - order - 1; i++) {
                ngram.push_back(START_TOKEN);
            }
            break;
        }
    }
-
+    std::reverse(ngram.begin(), ngram.end());
-    // reconstruct word
+    return ngram;
    std::string word = vec2str(prefix_vec);
    ngram.push_back(word);
    if (new_node->character == -1) {
      // No more spaces, but still need order
      for (int i = 0; i < max_order_ - order - 1; i++) {
        ngram.push_back(START_TOKEN);
      }
      break;
    }
  }
  std::reverse(ngram.begin(), ngram.end());
  return ngram;
 }
 void Scorer::fill_dictionary(bool add_space) {
-  fst::StdVectorFst dictionary;
+    fst::StdVectorFst dictionary;
-  // For each unigram convert to ints and put in trie
+    // For each unigram convert to ints and put in trie
-  int dict_size = 0;
+    int dict_size = 0;
-  for (const auto& word : vocabulary_) {
+    for (const auto& word : vocabulary_) {
-    bool added = add_word_to_dictionary(
+        bool added = add_word_to_dictionary(
-        word, char_map_, add_space, SPACE_ID_ + 1, &dictionary);
+            word, char_map_, add_space, SPACE_ID_ + 1, &dictionary);
-    dict_size += added ? 1 : 0;
+        dict_size += added ? 1 : 0;
-  }
+    }
-
+
-  dict_size_ = dict_size;
+    dict_size_ = dict_size;
-
+
-  /* Simplify FST
+    /* Simplify FST
-
+
-   * This gets rid of "epsilon" transitions in the FST.
+     * This gets rid of "epsilon" transitions in the FST.
-   * These are transitions that don't require a string input to be taken.
+     * These are transitions that don't require a string input to be taken.
-   * Getting rid of them is necessary to make the FST determinisitc, but
+     * Getting rid of them is necessary to make the FST determinisitc, but
-   * can greatly increase the size of the FST
+     * can greatly increase the size of the FST
-   */
+     */
-  fst::RmEpsilon(&dictionary);
+    fst::RmEpsilon(&dictionary);
-  fst::StdVectorFst* new_dict = new fst::StdVectorFst;
+    fst::StdVectorFst* new_dict = new fst::StdVectorFst;
-
+
-  /* This makes the FST deterministic, meaning for any string input there's
+    /* This makes the FST deterministic, meaning for any string input there's
-   * only one possible state the FST could be in.  It is assumed our
+     * only one possible state the FST could be in.  It is assumed our
-   * dictionary is deterministic when using it.
+     * dictionary is deterministic when using it.
-   * (lest we'd have to check for multiple transitions at each state)
+     * (lest we'd have to check for multiple transitions at each state)
-   */
+     */
-  fst::Determinize(dictionary, new_dict);
+    fst::Determinize(dictionary, new_dict);
-
+
-  /* Finds the simplest equivalent fst. This is unnecessary but decreases
+    /* Finds the simplest equivalent fst. This is unnecessary but decreases
-   * memory usage of the dictionary
+     * memory usage of the dictionary
-   */
+     */
-  fst::Minimize(new_dict);
+    fst::Minimize(new_dict);
-  this->dictionary = new_dict;
+    this->dictionary = new_dict;
 }
--- a/deepspeech/decoders/swig/scorer.h
+++ b/deepspeech/decoders/swig/scorer.h
@ -34,14 +34,14 @@ const std::string END_TOKEN = "</s>";
 // Implement a callback to retrive the dictionary of language model.
 class RetriveStrEnumerateVocab : public lm::EnumerateVocab {
-public:
+  public:
-  RetriveStrEnumerateVocab() {}
+    RetriveStrEnumerateVocab() {}
-  void Add(lm::WordIndex index, const StringPiece &str) {
+    void Add(lm::WordIndex index, const StringPiece &str) {
-    vocabulary.push_back(std::string(str.data(), str.length()));
+        vocabulary.push_back(std::string(str.data(), str.length()));
-  }
+    }
-  std::vector<std::string> vocabulary;
+    std::vector<std::string> vocabulary;
 };
 /* External scorer to query score for n-gram or sentence, including language
@ -53,74 +53,74 @@ public:
 *     scorer.get_sent_log_prob({ "WORD1", "WORD2", "WORD3" });
 */
 class Scorer {
-public:
+  public:
-  Scorer(double alpha,
+    Scorer(double alpha,
-         double beta,
+           double beta,
-         const std::string &lm_path,
+           const std::string &lm_path,
-         const std::vector<std::string> &vocabulary);
+           const std::vector<std::string> &vocabulary);
-  ~Scorer();
+    ~Scorer();
-  double get_log_cond_prob(const std::vector<std::string> &words);
+    double get_log_cond_prob(const std::vector<std::string> &words);
-  double get_sent_log_prob(const std::vector<std::string> &words);
+    double get_sent_log_prob(const std::vector<std::string> &words);
-  // return the max order
+    // return the max order
-  size_t get_max_order() const { return max_order_; }
+    size_t get_max_order() const { return max_order_; }
-  // return the dictionary size of language model
+    // return the dictionary size of language model
-  size_t get_dict_size() const { return dict_size_; }
+    size_t get_dict_size() const { return dict_size_; }
-  // retrun true if the language model is character based
+    // retrun true if the language model is character based
-  bool is_character_based() const { return is_character_based_; }
+    bool is_character_based() const { return is_character_based_; }
-  // reset params alpha & beta
+    // reset params alpha & beta
-  void reset_params(float alpha, float beta);
+    void reset_params(float alpha, float beta);
-  // make ngram for a given prefix
+    // make ngram for a given prefix
-  std::vector<std::string> make_ngram(PathTrie *prefix);
+    std::vector<std::string> make_ngram(PathTrie *prefix);
-  // trransform the labels in index to the vector of words (word based lm) or
+    // trransform the labels in index to the vector of words (word based lm) or
-  // the vector of characters (character based lm)
+    // the vector of characters (character based lm)
-  std::vector<std::string> split_labels(const std::vector<int> &labels);
+    std::vector<std::string> split_labels(const std::vector<int> &labels);
-  // language model weight
+    // language model weight
-  double alpha;
+    double alpha;
-  // word insertion weight
+    // word insertion weight
-  double beta;
+    double beta;
-  // pointer to the dictionary of FST
+    // pointer to the dictionary of FST
-  void *dictionary;
+    void *dictionary;
-protected:
+  protected:
-  // necessary setup: load language model, set char map, fill FST's dictionary
+    // necessary setup: load language model, set char map, fill FST's dictionary
-  void setup(const std::string &lm_path,
+    void setup(const std::string &lm_path,
-             const std::vector<std::string> &vocab_list);
+               const std::vector<std::string> &vocab_list);
-  // load language model from given path
+    // load language model from given path
-  void load_lm(const std::string &lm_path);
+    void load_lm(const std::string &lm_path);
-  // fill dictionary for FST
+    // fill dictionary for FST
-  void fill_dictionary(bool add_space);
+    void fill_dictionary(bool add_space);
-  // set char map
+    // set char map
-  void set_char_map(const std::vector<std::string> &char_list);
+    void set_char_map(const std::vector<std::string> &char_list);
-  double get_log_prob(const std::vector<std::string> &words);
+    double get_log_prob(const std::vector<std::string> &words);
-  // translate the vector in index to string
+    // translate the vector in index to string
-  std::string vec2str(const std::vector<int> &input);
+    std::string vec2str(const std::vector<int> &input);
-private:
+  private:
-  void *language_model_;
+    void *language_model_;
-  bool is_character_based_;
+    bool is_character_based_;
-  size_t max_order_;
+    size_t max_order_;
-  size_t dict_size_;
+    size_t dict_size_;
-  int SPACE_ID_;
+    int SPACE_ID_;
-  std::vector<std::string> char_list_;
+    std::vector<std::string> char_list_;
-  std::unordered_map<std::string, int> char_map_;
+    std::unordered_map<std::string, int> char_map_;
-  std::vector<std::string> vocabulary_;
+    std::vector<std::string> vocabulary_;
 };
 #endif  // SCORER_H_
--- a/deepspeech/decoders/swig/setup.py
+++ b/deepspeech/decoders/swig/setup.py
@ -12,13 +12,16 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 """Script to build and install decoder package."""
-
+import argparse
 from setuptools import setup, Extension, distutils
 import glob
 import platform
 import os, sys
 import multiprocessing.pool
-import argparse
+import os
 import platform
 import sys
 from setuptools import distutils
 from setuptools import Extension
 from setuptools import setup
 parser = argparse.ArgumentParser(description=__doc__)
 parser.add_argument(
@ -65,9 +68,9 @@ def parallelCCompile(self,
 def compile_test(header, library):
    dummy_path = os.path.join(os.path.dirname(__file__), "dummy")
    command = "bash -c \"g++ -include " + header \
-                + " -l" + library + " -x c++ - <<<'int main() {}' -o " \
+        + " -l" + library + " -x c++ - <<<'int main() {}' -o " \
-                + dummy_path + " >/dev/null 2>/dev/null && rm " \
+        + dummy_path + " >/dev/null 2>/dev/null && rm " \
-                + dummy_path + " 2>/dev/null\""
+        + dummy_path + " 2>/dev/null\""
    return os.system(command) == 0
@ -75,8 +78,8 @@ def compile_test(header, library):
 distutils.ccompiler.CCompiler.compile = parallelCCompile
 FILES = glob.glob('kenlm/util/*.cc') \
-        + glob.glob('kenlm/lm/*.cc') \
+    + glob.glob('kenlm/lm/*.cc') \
-        + glob.glob('kenlm/util/double-conversion/*.cc')
+    + glob.glob('kenlm/util/double-conversion/*.cc')
 FILES += glob.glob('openfst-1.6.3/src/lib/*.cc')
--- a/deepspeech/decoders/swig/setup.sh
+++ b/deepspeech/decoders/swig/setup.sh
--- a/deepspeech/decoders/swig_wrapper.py
+++ b/deepspeech/decoders/swig_wrapper.py
@ -12,7 +12,6 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 """Wrapper for various CTC decoders in SWIG."""
 import swig_decoders
--- a/deepspeech/decoders/tests/test_decoders.py
+++ b/deepspeech/decoders/tests/test_decoders.py
@ -12,8 +12,8 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 """Test decoders."""
 import unittest
 from deepspeech.decoders import decoders_deprecated as decoder
--- a/deepspeech/exps/deepspeech2/bin/deploy/client.py
+++ b/deepspeech/exps/deepspeech2/bin/deploy/client.py
@ -12,11 +12,10 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 """Client-end for the ASR demo."""
 import keyboard
 import struct
 import socket
 import sys
 import argparse
 import sys
 import keyboard
 import pyaudio
 from deepspeech.utils.socket_server import socket_send
@ -49,7 +48,7 @@ def on_press_release(x):
            sys.stdout.flush()
            is_recording = True
    if x.event_type == 'up' and x.name == release.name:
-        if is_recording == True:
+        if is_recording:
            is_recording = False
--- a/deepspeech/exps/deepspeech2/bin/deploy/record.py
+++ b/deepspeech/exps/deepspeech2/bin/deploy/record.py
@ -13,9 +13,10 @@
 # limitations under the License.
 """Record wav from Microphone"""
 # http://people.csail.mit.edu/hubert/pyaudio/
 import pyaudio
 import wave
 import pyaudio
 CHUNK = 1024
 FORMAT = pyaudio.paInt16
 CHANNELS = 1
--- a/deepspeech/exps/deepspeech2/bin/deploy/runtime.py
+++ b/deepspeech/exps/deepspeech2/bin/deploy/runtime.py
@ -12,28 +12,22 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 """Server-end for the ASR demo."""
 import os
 import time
 import argparse
 import functools
 import paddle
 import numpy as np
-from deepspeech.utils.socket_server import warm_up_test
+import numpy as np
-from deepspeech.utils.socket_server import AsrTCPServer
+import paddle
-from deepspeech.utils.socket_server import AsrRequestHandler
+from paddle.inference import Config
 from paddle.inference import create_predictor
 from deepspeech.training.cli import default_argument_parser
 from deepspeech.exps.deepspeech2.config import get_cfg_defaults
 from deepspeech.frontend.utility import read_manifest
 from deepspeech.utils.utility import add_arguments, print_arguments
 from deepspeech.models.deepspeech2 import DeepSpeech2Model
 from deepspeech.io.dataset import ManifestDataset
-
+from deepspeech.models.deepspeech2 import DeepSpeech2Model
-from paddle.inference import Config
+from deepspeech.training.cli import default_argument_parser
-from paddle.inference import create_predictor
+from deepspeech.utils.socket_server import AsrRequestHandler
 from deepspeech.utils.socket_server import AsrTCPServer
 from deepspeech.utils.socket_server import warm_up_test
 from deepspeech.utils.utility import add_arguments
 from deepspeech.utils.utility import print_arguments
 def init_predictor(args):
@ -83,23 +77,11 @@ def inference(config, args):
 def start_server(config, args):
    """Start the ASR server"""
-    dataset = ManifestDataset(
+    config.defrost()
-        config.data.test_manifest,
+    config.data.manfiest = config.data.test_manifest
-        config.data.vocab_filepath,
+    config.data.augmentation_config = ""
-        config.data.mean_std_filepath,
+    config.data.keep_transcription_text = True
-        augmentation_config="{}",
+    dataset = ManifestDataset.from_config(config)
        max_duration=config.data.max_duration,
        min_duration=config.data.min_duration,
        stride_ms=config.data.stride_ms,
        window_ms=config.data.window_ms,
        n_fft=config.data.n_fft,
        max_freq=config.data.max_freq,
        target_sample_rate=config.data.target_sample_rate,
        specgram_type=config.data.specgram_type,
        use_dB_normalization=config.data.use_dB_normalization,
        target_dB=config.data.target_dB,
        random_seed=config.data.random_seed,
        keep_transcription_text=True)
    model = DeepSpeech2Model.from_pretrained(dataset, config,
                                             args.checkpoint_path)
@ -171,22 +153,20 @@ if __name__ == "__main__":
        "--params_file",
        type=str,
        default="",
-        help=
+        help="Parameter filename, Specify this when your model is a combined model."
        "Parameter filename, Specify this when your model is a combined model."
    )
    add_arg(
        "--model_dir",
        type=str,
        default=None,
-        help=
+        help="Model dir, If you load a non-combined model, specify the directory of the model."
        "Model dir, If you load a non-combined model, specify the directory of the model."
    )
    add_arg("--use_gpu",
                        type=bool,
                        default=False,
                        help="Whether use gpu.")
    args = parser.parse_args()
-    print_arguments(args)
+    print_arguments(args, globals())
    # https://yaml.org/type/float.html
    config = get_cfg_defaults()
@ -198,7 +178,7 @@ if __name__ == "__main__":
    print(config)
    args.warmup_manifest = config.data.test_manifest
-    print_arguments(args)
+    print_arguments(args, globals())
    if args.dump_config:
        with open(args.dump_config, 'w') as f:
--- a/deepspeech/exps/deepspeech2/bin/deploy/send.py
+++ b/deepspeech/exps/deepspeech2/bin/deploy/send.py
@ -12,8 +12,6 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 """Socket client to send wav to ASR server."""
 import struct
 import socket
 import argparse
 import wave
--- a/deepspeech/exps/deepspeech2/bin/deploy/server.py
+++ b/deepspeech/exps/deepspeech2/bin/deploy/server.py
@ -12,46 +12,30 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 """Server-end for the ASR demo."""
 import os
 import time
 import argparse
 import functools
 import paddle
 import numpy as np
-from deepspeech.utils.socket_server import warm_up_test
+import numpy as np
-from deepspeech.utils.socket_server import AsrTCPServer
+import paddle
 from deepspeech.utils.socket_server import AsrRequestHandler
 from deepspeech.training.cli import default_argument_parser
 from deepspeech.exps.deepspeech2.config import get_cfg_defaults
 from deepspeech.frontend.utility import read_manifest
 from deepspeech.utils.utility import add_arguments, print_arguments
 from deepspeech.models.deepspeech2 import DeepSpeech2Model
 from deepspeech.io.dataset import ManifestDataset
 from deepspeech.models.deepspeech2 import DeepSpeech2Model
 from deepspeech.training.cli import default_argument_parser
 from deepspeech.utils.socket_server import AsrRequestHandler
 from deepspeech.utils.socket_server import AsrTCPServer
 from deepspeech.utils.socket_server import warm_up_test
 from deepspeech.utils.utility import add_arguments
 from deepspeech.utils.utility import print_arguments
 def start_server(config, args):
    """Start the ASR server"""
-    dataset = ManifestDataset(
+    config.defrost()
-        config.data.test_manifest,
+    config.data.manfiest = config.data.test_manifest
-        config.data.vocab_filepath,
+    config.data.augmentation_config = ""
-        config.data.mean_std_filepath,
+    config.data.keep_transcription_text = True
-        augmentation_config="{}",
+    dataset = ManifestDataset.from_config(config)
-        max_duration=config.data.max_duration,
+
        min_duration=config.data.min_duration,
        stride_ms=config.data.stride_ms,
        window_ms=config.data.window_ms,
        n_fft=config.data.n_fft,
        max_freq=config.data.max_freq,
        target_sample_rate=config.data.target_sample_rate,
        specgram_type=config.data.specgram_type,
        use_dB_normalization=config.data.use_dB_normalization,
        target_dB=config.data.target_dB,
        random_seed=config.data.random_seed,
        keep_transcription_text=True)
    model = DeepSpeech2Model.from_pretrained(dataset, config,
                                             args.checkpoint_path)
    model.eval()
@ -111,9 +95,9 @@ if __name__ == "__main__":
    add_arg('speech_save_dir',  str,
            'demo_cache',
            "Directory to save demo audios.")
-    add_arg('warmup_manifest',  str, None, "Filepath of manifest to warm up.")
+    add_arg('warmup_manifest', str, None, "Filepath of manifest to warm up.")
    args = parser.parse_args()
-    print_arguments(args)
+    print_arguments(args, globals())
    # https://yaml.org/type/float.html
    config = get_cfg_defaults()
@ -125,7 +109,7 @@ if __name__ == "__main__":
    print(config)
    args.warmup_manifest = config.data.test_manifest
-    print_arguments(args)
+    print_arguments(args, globals())
    if args.dump_config:
        with open(args.dump_config, 'w') as f:
--- a/deepspeech/exps/deepspeech2/bin/export.py
+++ b/deepspeech/exps/deepspeech2/bin/export.py
@ -12,20 +12,10 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 """Export for DeepSpeech2 model."""
 import io
 import logging
 import argparse
 import functools
 from paddle import distributed as dist
 from deepspeech.training.cli import default_argument_parser
 from deepspeech.utils.utility import print_arguments
 from deepspeech.utils.error_rate import char_errors, word_errors
 from deepspeech.exps.deepspeech2.config import get_cfg_defaults
 from deepspeech.exps.deepspeech2.model import DeepSpeech2Tester as Tester
 from deepspeech.training.cli import default_argument_parser
 from deepspeech.utils.utility import print_arguments
 def main_sp(config, args):
--- a/deepspeech/exps/deepspeech2/bin/test.py
+++ b/deepspeech/exps/deepspeech2/bin/test.py
@ -12,20 +12,10 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 """Evaluation for DeepSpeech2 model."""
 import io
 import logging
 import argparse
 import functools
 from paddle import distributed as dist
 from deepspeech.training.cli import default_argument_parser
 from deepspeech.utils.utility import print_arguments
 from deepspeech.utils.error_rate import char_errors, word_errors
 from deepspeech.exps.deepspeech2.config import get_cfg_defaults
 from deepspeech.exps.deepspeech2.model import DeepSpeech2Tester as Tester
 from deepspeech.training.cli import default_argument_parser
 from deepspeech.utils.utility import print_arguments
 def main_sp(config, args):
@ -41,7 +31,7 @@ def main(config, args):
 if __name__ == "__main__":
    parser = default_argument_parser()
    args = parser.parse_args()
-    print_arguments(args)
+    print_arguments(args, globals())
    # https://yaml.org/type/float.html
    config = get_cfg_defaults()
--- a/deepspeech/exps/deepspeech2/bin/train.py
+++ b/deepspeech/exps/deepspeech2/bin/train.py
@ -12,19 +12,12 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 """Trainer for DeepSpeech2 model."""
 import io
 import logging
 import argparse
 import functools
 from paddle import distributed as dist
 from deepspeech.utils.utility import print_arguments
 from deepspeech.training.cli import default_argument_parser
 from deepspeech.exps.deepspeech2.config import get_cfg_defaults
 from deepspeech.exps.deepspeech2.model import DeepSpeech2Trainer as Trainer
 from deepspeech.training.cli import default_argument_parser
 from deepspeech.utils.utility import print_arguments
 def main_sp(config, args):
@ -43,7 +36,7 @@ def main(config, args):
 if __name__ == "__main__":
    parser = default_argument_parser()
    args = parser.parse_args()
-    print_arguments(args)
+    print_arguments(args, globals())
    # https://yaml.org/type/float.html
    config = get_cfg_defaults()
--- a/deepspeech/exps/deepspeech2/bin/tune.py
+++ b/deepspeech/exps/deepspeech2/bin/tune.py
@ -12,26 +12,20 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 """Beam search parameters tuning for DeepSpeech2 model."""
 import sys
 import os
 import numpy as np
 import argparse
 import functools
-import gzip
+import sys
 import logging
 import numpy as np
 from paddle.io import DataLoader
-from deepspeech.utils import error_rate
+from deepspeech.exps.deepspeech2.config import get_cfg_defaults
 from deepspeech.utils.utility import add_arguments, print_arguments
 from deepspeech.models.deepspeech2 import DeepSpeech2Model
 from deepspeech.io.collator import SpeechCollator
 from deepspeech.io.dataset import ManifestDataset
-
+from deepspeech.models.deepspeech2 import DeepSpeech2Model
 from deepspeech.training.cli import default_argument_parser
-from deepspeech.exps.deepspeech2.config import get_cfg_defaults
+from deepspeech.utils import error_rate
 from deepspeech.utils.utility import add_arguments
 from deepspeech.utils.utility import print_arguments
 def tune(config, args):
@ -40,31 +34,18 @@ def tune(config, args):
        raise ValueError("num_alphas must be non-negative!")
    if not args.num_betas >= 0:
        raise ValueError("num_betas must be non-negative!")
-
+    config.defrost()
-    dev_dataset = ManifestDataset(
+    config.data.manfiest = config.data.dev_manifest
-        config.data.dev_manifest,
+    config.data.augmentation_config = ""
-        config.data.vocab_filepath,
+    config.data.keep_transcription_text = True
-        config.data.mean_std_filepath,
+    dev_dataset = ManifestDataset.from_config(config)
        augmentation_config="{}",
        max_duration=config.data.max_duration,
        min_duration=config.data.min_duration,
        stride_ms=config.data.stride_ms,
        window_ms=config.data.window_ms,
        n_fft=config.data.n_fft,
        max_freq=config.data.max_freq,
        target_sample_rate=config.data.target_sample_rate,
        specgram_type=config.data.specgram_type,
        use_dB_normalization=config.data.use_dB_normalization,
        target_dB=config.data.target_dB,
        random_seed=config.data.random_seed,
        keep_transcription_text=True)
    valid_loader = DataLoader(
        dev_dataset,
        batch_size=config.data.batch_size,
        shuffle=False,
        drop_last=False,
-        collate_fn=SpeechCollator(is_training=False))
+        collate_fn=SpeechCollator(keep_transcription_text=True))
    model = DeepSpeech2Model.from_pretrained(dev_dataset, config,
                                             args.checkpoint_path)
@ -103,13 +84,13 @@ def tune(config, args):
                trans.append(''.join([chr(i) for i in ids]))
            return trans
-        audio, text, audio_len, text_len = infer_data
+        audio, audio_len, text, text_len = infer_data
        target_transcripts = ordid2token(text, text_len)
        num_ins += audio.shape[0]
        # model infer
        eouts, eouts_len = model.encoder(audio, audio_len)
-        probs = model.decoder.probs(eouts)
+        probs = model.decoder.softmax(eouts)
        # grid search
        for index, (alpha, beta) in enumerate(params_grid):
@ -134,7 +115,7 @@ def tune(config, args):
            if index % 2 == 0:
                sys.stdout.write('.')
                sys.stdout.flush()
-            print(f"tuneing: one grid done!")
+            print("tuneing: one grid done!")
        # output on-line tuning result at the end of current batch
        err_ave_min = min(err_ave)
@ -185,7 +166,7 @@ if __name__ == "__main__":
    add_arg('cutoff_top_n', int, 40, "Cutoff number for pruning.")
    args = parser.parse_args()
-    print_arguments(args)
+    print_arguments(args, globals())
    # https://yaml.org/type/float.html
    config = get_cfg_defaults()
--- a/deepspeech/exps/deepspeech2/config.py
+++ b/deepspeech/exps/deepspeech2/config.py
@ -11,8 +11,8 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 from yacs.config import CfgNode as CN
 from deepspeech.models.deepspeech2 import DeepSpeech2Model
 _C = CN()
@ -21,7 +21,9 @@ _C.data = CN(
        train_manifest="",
        dev_manifest="",
        test_manifest="",
        unit_type="char",
        vocab_filepath="",
        spm_model_prefix="",
        mean_std_filepath="",
        augmentation_config="",
        max_duration=float('inf'),
@ -30,8 +32,10 @@ _C.data = CN(
        window_ms=20.0,  # ms
        n_fft=None,  # fft points
        max_freq=None,  # None for samplerate/2
-        specgram_type='linear',  # 'linear', 'mfcc'
+        specgram_type='linear',  # 'linear', 'mfcc', 'fbank'
-        target_sample_rate=16000,  # sample rate
+        feat_dim=0,  # 'mfcc', 'fbank'
        delat_delta=False,  # 'mfcc', 'fbank'
        target_sample_rate=16000,  # target sample rate
        use_dB_normalization=True,
        target_dB=-20,
        random_seed=0,
@ -81,4 +85,6 @@ def get_cfg_defaults():
    """Get a yacs CfgNode object with default values for my_project."""
    # Return a clone so that the defaults will not be altered
    # This is for the "local variable" use pattern
-    return _C.clone()
+    config = _C.clone()
    config.set_new_allowed(True)
    return config
--- a/deepspeech/exps/deepspeech2/model.py
+++ b/deepspeech/exps/deepspeech2/model.py
@ -12,46 +12,38 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 """Contains DeepSpeech2 model."""
 import io
 import sys
 import os
 import time
 import logging
 import numpy as np
 from collections import defaultdict
 from functools import partial
 from pathlib import Path
 import numpy as np
 import paddle
 from paddle import distributed as dist
 from paddle.io import DataLoader
 from deepspeech.training import Trainer
 from deepspeech.training.gradclip import MyClipGradByGlobalNorm
 from deepspeech.utils import mp_tools
 from deepspeech.utils import layer_tools
 from deepspeech.utils import error_rate
 from deepspeech.io.collator import SpeechCollator
 from deepspeech.io.sampler import SortagradDistributedBatchSampler
 from deepspeech.io.sampler import SortagradBatchSampler
 from deepspeech.io.dataset import ManifestDataset
-
+from deepspeech.io.sampler import SortagradBatchSampler
-from deepspeech.models.deepspeech2 import DeepSpeech2Model
+from deepspeech.io.sampler import SortagradDistributedBatchSampler
 from deepspeech.models.deepspeech2 import DeepSpeech2InferModel
 from deepspeech.models.deepspeech2 import DeepSpeech2Model
 from deepspeech.training.gradclip import ClipGradByGlobalNormWithLog
 from deepspeech.training.trainer import Trainer
 from deepspeech.utils import error_rate
 from deepspeech.utils import layer_tools
 from deepspeech.utils import mp_tools
 from deepspeech.utils.log import Log
-logger = logging.getLogger(__name__)
+logger = Log(__name__).getlog()
 class DeepSpeech2Trainer(Trainer):
    def __init__(self, config, args):
        super().__init__(config, args)
-    def train_batch(self, batch_data):
+    def train_batch(self, batch_index, batch_data, msg):
        start = time.time()
-        self.model.train()
+
        loss = self.model(*batch_data)
        loss.backward()
        layer_tools.print_grads(self.model, print_func=None)
@ -63,46 +55,49 @@ class DeepSpeech2Trainer(Trainer):
        losses_np = {
            'train_loss': float(loss),
        }
-        msg = "Train: Rank: {}, ".format(dist.get_rank())
+        msg += "train time: {:>.3f}s, ".format(iteration_time)
-        msg += "epoch: {}, ".format(self.epoch)
+        msg += "batch size: {}, ".format(self.config.data.batch_size)
        msg += "step: {}, ".format(self.iteration)
        msg += "time: {:>.3f}s, ".format(iteration_time)
        msg += ', '.join('{}: {:>.6f}'.format(k, v)
                         for k, v in losses_np.items())
-        self.logger.info(msg)
+        logger.info(msg)
        if dist.get_rank() == 0 and self.visualizer:
            for k, v in losses_np.items():
                self.visualizer.add_scalar("train/{}".format(k), v,
                                           self.iteration)
        self.iteration += 1
    @mp_tools.rank_zero_only
    @paddle.no_grad()
    def valid(self):
-        self.logger.info(
+        logger.info(f"Valid Total Examples: {len(self.valid_loader.dataset)}")
            f"Valid Total Examples: {len(self.valid_loader.dataset)}")
        self.model.eval()
        valid_losses = defaultdict(list)
        num_seen_utts = 1
        total_loss = 0.0
        for i, batch in enumerate(self.valid_loader):
            loss = self.model(*batch)
-
+            if paddle.isfinite(loss):
-            valid_losses['val_loss'].append(float(loss))
+                num_utts = batch[0].shape[0]
-
+                num_seen_utts += num_utts
-        # write visual log
+                total_loss += float(loss) * num_utts
-        valid_losses = {k: np.mean(v) for k, v in valid_losses.items()}
+                valid_losses['val_loss'].append(float(loss))
-
+
-        # logging
+            if (i + 1) % self.config.training.log_interval == 0:
-        msg = f"Valid: Rank: {dist.get_rank()}, "
+                valid_dump = {k: np.mean(v) for k, v in valid_losses.items()}
-        msg += "epoch: {}, ".format(self.epoch)
+                valid_dump['val_history_loss'] = total_loss / num_seen_utts
-        msg += "step: {}, ".format(self.iteration)
+
-        msg += ', '.join('{}: {:>.6f}'.format(k, v)
+                # logging
-                         for k, v in valid_losses.items())
+                msg = f"Valid: Rank: {dist.get_rank()}, "
-        self.logger.info(msg)
+                msg += "epoch: {}, ".format(self.epoch)
-
+                msg += "step: {}, ".format(self.iteration)
-        if self.visualizer:
+                msg += "batch : {}/{}, ".format(i + 1, len(self.valid_loader))
-            for k, v in valid_losses.items():
+                msg += ', '.join('{}: {:>.6f}'.format(k, v)
-                self.visualizer.add_scalar("valid/{}".format(k), v,
+                                 for k, v in valid_dump.items())
-                                           self.iteration)
+                logger.info(msg)
        logger.info('Rank {} Val info val_loss {}'.format(
            dist.get_rank(), total_loss / num_seen_utts))
        return total_loss, num_seen_utts
    def setup_model(self):
        config = self.config
@ -118,9 +113,11 @@ class DeepSpeech2Trainer(Trainer):
        if self.parallel:
            model = paddle.DataParallel(model)
-        layer_tools.print_params(model, self.logger.info)
+        logger.info(f"{model}")
        layer_tools.print_params(model, logger.info)
-        grad_clip = MyClipGradByGlobalNorm(config.training.global_grad_clip)
+        grad_clip = ClipGradByGlobalNormWithLog(
            config.training.global_grad_clip)
        lr_scheduler = paddle.optimizer.lr.ExponentialDecay(
            learning_rate=config.training.lr,
            gamma=config.training.lr_decay,
@ -135,48 +132,19 @@ class DeepSpeech2Trainer(Trainer):
        self.model = model
        self.optimizer = optimizer
        self.lr_scheduler = lr_scheduler
-        self.logger.info("Setup model/optimizer/lr_scheduler!")
+        logger.info("Setup model/optimizer/lr_scheduler!")
    def setup_dataloader(self):
-        config = self.config
+        config = self.config.clone()
        config.defrost()
        config.data.keep_transcription_text = False
        config.data.manifest = config.data.train_manifest
        train_dataset = ManifestDataset.from_config(config)
-        train_dataset = ManifestDataset(
+        config.data.manifest = config.data.dev_manifest
-            config.data.train_manifest,
+        config.data.augmentation_config = ""
-            config.data.vocab_filepath,
+        dev_dataset = ManifestDataset.from_config(config)
            config.data.mean_std_filepath,
            augmentation_config=io.open(
                config.data.augmentation_config, mode='r',
                encoding='utf8').read(),
            max_duration=config.data.max_duration,
            min_duration=config.data.min_duration,
            stride_ms=config.data.stride_ms,
            window_ms=config.data.window_ms,
            n_fft=config.data.n_fft,
            max_freq=config.data.max_freq,
            target_sample_rate=config.data.target_sample_rate,
            specgram_type=config.data.specgram_type,
            use_dB_normalization=config.data.use_dB_normalization,
            target_dB=config.data.target_dB,
            random_seed=config.data.random_seed,
            keep_transcription_text=False)
        dev_dataset = ManifestDataset(
            config.data.dev_manifest,
            config.data.vocab_filepath,
            config.data.mean_std_filepath,
            augmentation_config="{}",
            max_duration=config.data.max_duration,
            min_duration=config.data.min_duration,
            stride_ms=config.data.stride_ms,
            window_ms=config.data.window_ms,
            n_fft=config.data.n_fft,
            max_freq=config.data.max_freq,
            target_sample_rate=config.data.target_sample_rate,
            specgram_type=config.data.specgram_type,
            use_dB_normalization=config.data.use_dB_normalization,
            target_dB=config.data.target_dB,
            random_seed=config.data.random_seed,
            keep_transcription_text=False)
        if self.parallel:
            batch_sampler = SortagradDistributedBatchSampler(
@ -197,7 +165,7 @@ class DeepSpeech2Trainer(Trainer):
                sortagrad=config.data.sortagrad,
                shuffle_method=config.data.shuffle_method)
-        collate_fn = SpeechCollator(is_training=True)
+        collate_fn = SpeechCollator(keep_transcription_text=False)
        self.train_loader = DataLoader(
            train_dataset,
            batch_sampler=batch_sampler,
@ -209,7 +177,7 @@ class DeepSpeech2Trainer(Trainer):
            shuffle=False,
            drop_last=False,
            collate_fn=collate_fn)
-        self.logger.info("Setup train/valid Dataloader!")
+        logger.info("Setup train/valid Dataloader!")
 class DeepSpeech2Tester(DeepSpeech2Trainer):
@ -225,7 +193,7 @@ class DeepSpeech2Tester(DeepSpeech2Trainer):
            trans.append(''.join([chr(i) for i in ids]))
        return trans
-    def compute_metrics(self, audio, texts, audio_len, texts_len):
+    def compute_metrics(self, audio, audio_len, texts, texts_len):
        cfg = self.config.decoding
        errors_sum, len_refs, num_ins = 0.0, 0, 0
        errors_func = error_rate.char_errors if cfg.error_rate_type == 'cer' else error_rate.word_errors
@ -252,11 +220,10 @@ class DeepSpeech2Tester(DeepSpeech2Trainer):
            errors_sum += errors
            len_refs += len_ref
            num_ins += 1
-            self.logger.info(
+            logger.info("\nTarget Transcription: %s\nOutput Transcription: %s" %
-                "\nTarget Transcription: %s\nOutput Transcription: %s" %
+                        (target, result))
-                (target, result))
+            logger.info("Current error rate [%s] = %f" %
-            self.logger.info("Current error rate [%s] = %f" % (
+                        (cfg.error_rate_type, error_rate_func(target, result)))
                cfg.error_rate_type, error_rate_func(target, result)))
        return dict(
            errors_sum=errors_sum,
@ -268,8 +235,7 @@ class DeepSpeech2Tester(DeepSpeech2Trainer):
    @mp_tools.rank_zero_only
    @paddle.no_grad()
    def test(self):
-        self.logger.info(
+        logger.info(f"Test Total Examples: {len(self.test_loader.dataset)}")
            f"Test Total Examples: {len(self.test_loader.dataset)}")
        self.model.eval()
        cfg = self.config
        error_rate_type = None
@ -281,19 +247,19 @@ class DeepSpeech2Tester(DeepSpeech2Trainer):
            len_refs += metrics['len_refs']
            num_ins += metrics['num_ins']
            error_rate_type = metrics['error_rate_type']
-            self.logger.info("Error rate [%s] (%d/?) = %f" %
+            logger.info("Error rate [%s] (%d/?) = %f" %
-                             (error_rate_type, num_ins, errors_sum / len_refs))
+                        (error_rate_type, num_ins, errors_sum / len_refs))
        # logging
        msg = "Test: "
        msg += "epoch: {}, ".format(self.epoch)
        msg += "step: {}, ".format(self.iteration)
-        msg += ", Final error rate [%s] (%d/%d) = %f" % (
+        msg += "Final error rate [%s] (%d/%d) = %f" % (
            error_rate_type, num_ins, num_ins, errors_sum / len_refs)
-        self.logger.info(msg)
+        logger.info(msg)
    def run_test(self):
-        self.resume_or_load()
+        self.resume_or_scratch()
        try:
            self.test()
        except KeyboardInterrupt:
@ -329,7 +295,6 @@ class DeepSpeech2Tester(DeepSpeech2Trainer):
        self.setup_output_dir()
        self.setup_checkpointer()
        self.setup_logger()
        self.setup_dataloader()
        self.setup_model()
@ -348,28 +313,25 @@ class DeepSpeech2Tester(DeepSpeech2Trainer):
            use_gru=config.model.use_gru,
            share_rnn_weights=config.model.share_rnn_weights)
        self.model = model
-        self.logger.info("Setup model!")
+        logger.info("Setup model!")
    def setup_dataloader(self):
-        config = self.config
+        config = self.config.clone()
        config.defrost()
        # return raw text
-        test_dataset = ManifestDataset(
+
-            config.data.test_manifest,
+        config.data.manifest = config.data.test_manifest
-            config.data.vocab_filepath,
+        config.data.keep_transcription_text = True
-            config.data.mean_std_filepath,
+        config.data.augmentation_config = ""
-            augmentation_config="{}",
+        # filter test examples, will cause less examples, but no mismatch with training
-            max_duration=config.data.max_duration,
+        # and can use large batch size , save training time, so filter test egs now.
-            min_duration=config.data.min_duration,
+        # config.data.min_input_len = 0.0  # second
-            stride_ms=config.data.stride_ms,
+        # config.data.max_input_len = float('inf')  # second
-            window_ms=config.data.window_ms,
+        # config.data.min_output_len = 0.0  # tokens
-            n_fft=config.data.n_fft,
+        # config.data.max_output_len = float('inf')  # tokens
-            max_freq=config.data.max_freq,
+        # config.data.min_output_input_ratio = 0.00
-            target_sample_rate=config.data.target_sample_rate,
+        # config.data.max_output_input_ratio = float('inf')
-            specgram_type=config.data.specgram_type,
+        test_dataset = ManifestDataset.from_config(config)
            use_dB_normalization=config.data.use_dB_normalization,
            target_dB=config.data.target_dB,
            random_seed=config.data.random_seed,
            keep_transcription_text=True)
        # return text ord id
        self.test_loader = DataLoader(
@ -377,8 +339,8 @@ class DeepSpeech2Tester(DeepSpeech2Trainer):
            batch_size=config.decoding.batch_size,
            shuffle=False,
            drop_last=False,
-            collate_fn=SpeechCollator(is_training=False))
+            collate_fn=SpeechCollator(keep_transcription_text=True))
-        self.logger.info("Setup test Dataloader!")
+        logger.info("Setup test Dataloader!")
    def setup_output_dir(self):
        """Create a directory used for output.
@ -393,25 +355,3 @@ class DeepSpeech2Tester(DeepSpeech2Trainer):
            output_dir.mkdir(parents=True, exist_ok=True)
        self.output_dir = output_dir
    def setup_logger(self):
        """Initialize a text logger to log the experiment.
        Each process has its own text logger. The logging message is write to 
        the standard output and a text file named ``worker_n.log`` in the 
        output directory, where ``n`` means the rank of the process. 
        """
        format = '[%(levelname)s %(asctime)s %(filename)s:%(lineno)d] %(message)s'
        formatter = logging.Formatter(fmt=format, datefmt='%Y/%m/%d %H:%M:%S')
        logger.setLevel("INFO")
        # global logger
        stdout = True
        save_path = ""
        logging.basicConfig(
            level=logging.DEBUG if stdout else logging.INFO,
            format=format,
            datefmt='%Y/%m/%d %H:%M:%S',
            filename=save_path if not stdout else None)
        self.logger = logger
--- a/deepspeech/exps/u2/init.py
+++ b/deepspeech/exps/u2/init.py
@ -0,0 +1,13 @@
 # Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
--- a/deepspeech/exps/u2/bin/export.py
+++ b/deepspeech/exps/u2/bin/export.py
@ -0,0 +1,48 @@
 # Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 """Export for U2 model."""
 from deepspeech.exps.u2.config import get_cfg_defaults
 from deepspeech.exps.u2.model import U2Tester as Tester
 from deepspeech.training.cli import default_argument_parser
 from deepspeech.utils.utility import print_arguments
 def main_sp(config, args):
    exp = Tester(config, args)
    exp.setup()
    exp.run_export()
 def main(config, args):
    main_sp(config, args)
 if __name__ == "__main__":
    parser = default_argument_parser()
    args = parser.parse_args()
    print_arguments(args, globals())
    # https://yaml.org/type/float.html
    config = get_cfg_defaults()
    if args.config:
        config.merge_from_file(args.config)
    if args.opts:
        config.merge_from_list(args.opts)
    config.freeze()
    print(config)
    if args.dump_config:
        with open(args.dump_config, 'w') as f:
            print(config, file=f)
    main(config, args)
--- a/deepspeech/exps/deepspeech2/bin/infer.py
+++ b/deepspeech/exps/deepspeech2/bin/infer.py
@ -11,22 +11,15 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
-"""Inferer for DeepSpeech2 model."""
+"""Evaluation for U2 model."""
-
+import cProfile
 import io
 import logging
 import argparse
 import functools
 from paddle import distributed as dist
 from deepspeech.exps.u2.config import get_cfg_defaults
 from deepspeech.exps.u2.model import U2Tester as Tester
 from deepspeech.training.cli import default_argument_parser
 from deepspeech.utils.utility import print_arguments
 from deepspeech.utils.error_rate import char_errors, word_errors
 # TODO(hui zhang): dynamic load 
 from deepspeech.exps.deepspeech2.config import get_cfg_defaults
 from deepspeech.exps.deepspeech2.model import DeepSpeech2Tester as Tester
 def main_sp(config, args):
@ -42,7 +35,7 @@ def main(config, args):
 if __name__ == "__main__":
    parser = default_argument_parser()
    args = parser.parse_args()
-    print_arguments(args)
+    print_arguments(args, globals())
    # https://yaml.org/type/float.html
    config = get_cfg_defaults()
@ -56,4 +49,7 @@ if __name__ == "__main__":
        with open(args.dump_config, 'w') as f:
            print(config, file=f)
-    main(config, args)
+    # Setting for profiling
    pr = cProfile.Profile()
    pr.runcall(main, config, args)
    pr.dump_stats('test.profile')
--- a/deepspeech/exps/u2/bin/train.py
+++ b/deepspeech/exps/u2/bin/train.py
@ -0,0 +1,59 @@
 # Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 """Trainer for U2 model."""
 import cProfile
 import os
 from paddle import distributed as dist
 from deepspeech.exps.u2.config import get_cfg_defaults
 from deepspeech.exps.u2.model import U2Trainer as Trainer
 from deepspeech.training.cli import default_argument_parser
 from deepspeech.utils.utility import print_arguments
 def main_sp(config, args):
    exp = Trainer(config, args)
    exp.setup()
    exp.run()
 def main(config, args):
    if args.device == "gpu" and args.nprocs > 1:
        dist.spawn(main_sp, args=(config, args), nprocs=args.nprocs)
    else:
        main_sp(config, args)
 if __name__ == "__main__":
    parser = default_argument_parser()
    args = parser.parse_args()
    print_arguments(args, globals())
    # https://yaml.org/type/float.html
    config = get_cfg_defaults()
    if args.config:
        config.merge_from_file(args.config)
    if args.opts:
        config.merge_from_list(args.opts)
    config.freeze()
    print(config)
    if args.dump_config:
        with open(args.dump_config, 'w') as f:
            print(config, file=f)
    # Setting for profiling
    pr = cProfile.Profile()
    pr.runcall(main, config, args)
    pr.dump_stats(os.path.join(args.output, 'train.profile'))
--- a/deepspeech/exps/u2/config.py
+++ b/deepspeech/exps/u2/config.py
@ -0,0 +1,38 @@
 # Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 from yacs.config import CfgNode
 from deepspeech.exps.u2.model import U2Tester
 from deepspeech.exps.u2.model import U2Trainer
 from deepspeech.io.dataset import ManifestDataset
 from deepspeech.models.u2 import U2Model
 _C = CfgNode()
 _C.data = ManifestDataset.params()
 _C.model = U2Model.params()
 _C.training = U2Trainer.params()
 _C.decoding = U2Tester.params()
 def get_cfg_defaults():
    """Get a yacs CfgNode object with default values for my_project."""
    # Return a clone so that the defaults will not be altered
    # This is for the "local variable" use pattern
    config = _C.clone()
    config.set_new_allowed(True)
    return config
--- a/deepspeech/exps/u2/model.py
+++ b/deepspeech/exps/u2/model.py
@ -0,0 +1,545 @@
 # Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 """Contains U2 model."""
 import json
 import os
 import sys
 import time
 from collections import defaultdict
 from pathlib import Path
 from typing import Optional
 import numpy as np
 import paddle
 from paddle import distributed as dist
 from paddle.io import DataLoader
 from yacs.config import CfgNode
 from deepspeech.io.collator import SpeechCollator
 from deepspeech.io.dataset import ManifestDataset
 from deepspeech.io.sampler import SortagradBatchSampler
 from deepspeech.io.sampler import SortagradDistributedBatchSampler
 from deepspeech.models.u2 import U2Model
 from deepspeech.training.gradclip import ClipGradByGlobalNormWithLog
 from deepspeech.training.scheduler import WarmupLR
 from deepspeech.training.trainer import Trainer
 from deepspeech.utils import error_rate
 from deepspeech.utils import layer_tools
 from deepspeech.utils import mp_tools
 from deepspeech.utils.log import Log
 logger = Log(__name__).getlog()
 class U2Trainer(Trainer):
    @classmethod
    def params(cls, config: Optional[CfgNode]=None) -> CfgNode:
        # training config
        default = CfgNode(
            dict(
                n_epoch=50,  # train epochs
                log_interval=100,  # steps
                accum_grad=1,  # accum grad by # steps
                global_grad_clip=5.0,  # the global norm clip
            ))
        default.optim = 'adam'
        default.optim_conf = CfgNode(
            dict(
                lr=5e-4,  # learning rate
                weight_decay=1e-6,  # the coeff of weight decay
            ))
        default.scheduler = 'warmuplr'
        default.scheduler_conf = CfgNode(
            dict(
                warmup_steps=25000,
                lr_decay=1.0,  # learning rate decay
            ))
        if config is not None:
            config.merge_from_other_cfg(default)
        return default
    def __init__(self, config, args):
        super().__init__(config, args)
    def train_batch(self, batch_index, batch_data, msg):
        train_conf = self.config.training
        start = time.time()
        loss, attention_loss, ctc_loss = self.model(*batch_data)
        # loss div by `batch_size * accum_grad`
        loss /= train_conf.accum_grad
        loss.backward()
        layer_tools.print_grads(self.model, print_func=None)
        losses_np = {'loss': float(loss) * train_conf.accum_grad}
        if attention_loss:
            losses_np['att_loss'] = float(attention_loss)
        if ctc_loss:
            losses_np['ctc_loss'] = float(ctc_loss)
        if (batch_index + 1) % train_conf.accum_grad == 0:
            self.optimizer.step()
            self.optimizer.clear_grad()
            self.lr_scheduler.step()
            self.iteration += 1
        iteration_time = time.time() - start
        if (batch_index + 1) % train_conf.log_interval == 0:
            msg += "train time: {:>.3f}s, ".format(iteration_time)
            msg += "batch size: {}, ".format(self.config.data.batch_size)
            msg += "accum: {}, ".format(train_conf.accum_grad)
            msg += ', '.join('{}: {:>.6f}'.format(k, v)
                             for k, v in losses_np.items())
            logger.info(msg)
            if dist.get_rank() == 0 and self.visualizer:
                losses_np_v = losses_np.copy()
                losses_np_v.update({"lr": self.lr_scheduler()})
                self.visualizer.add_scalars("step", losses_np_v,
                                            self.iteration - 1)
    @paddle.no_grad()
    def valid(self):
        self.model.eval()
        logger.info(f"Valid Total Examples: {len(self.valid_loader.dataset)}")
        valid_losses = defaultdict(list)
        num_seen_utts = 1
        total_loss = 0.0
        for i, batch in enumerate(self.valid_loader):
            loss, attention_loss, ctc_loss = self.model(*batch)
            if paddle.isfinite(loss):
                num_utts = batch[0].shape[0]
                num_seen_utts += num_utts
                total_loss += float(loss) * num_utts
                valid_losses['val_loss'].append(float(loss))
                if attention_loss:
                    valid_losses['val_att_loss'].append(float(attention_loss))
                if ctc_loss:
                    valid_losses['val_ctc_loss'].append(float(ctc_loss))
            if (i + 1) % self.config.training.log_interval == 0:
                valid_dump = {k: np.mean(v) for k, v in valid_losses.items()}
                valid_dump['val_history_loss'] = total_loss / num_seen_utts
                # logging
                msg = f"Valid: Rank: {dist.get_rank()}, "
                msg += "epoch: {}, ".format(self.epoch)
                msg += "step: {}, ".format(self.iteration)
                msg += "batch: {}/{}, ".format(i + 1, len(self.valid_loader))
                msg += ', '.join('{}: {:>.6f}'.format(k, v)
                                 for k, v in valid_dump.items())
                logger.info(msg)
        logger.info('Rank {} Val info val_loss {}'.format(
            dist.get_rank(), total_loss / num_seen_utts))
        return total_loss, num_seen_utts
    def train(self):
        """The training process control by step."""
        # !!!IMPORTANT!!!
        # Try to export the model by script, if fails, we should refine
        # the code to satisfy the script export requirements
        # script_model = paddle.jit.to_static(self.model)
        # script_model_path = str(self.checkpoint_dir / 'init')
        # paddle.jit.save(script_model, script_model_path)
        from_scratch = self.resume_or_scratch()
        if from_scratch:
            # save init model, i.e. 0 epoch
            self.save(tag='init')
        self.lr_scheduler.step(self.iteration)
        if self.parallel:
            self.train_loader.batch_sampler.set_epoch(self.epoch)
        logger.info(f"Train Total Examples: {len(self.train_loader.dataset)}")
        while self.epoch < self.config.training.n_epoch:
            self.model.train()
            try:
                data_start_time = time.time()
                for batch_index, batch in enumerate(self.train_loader):
                    dataload_time = time.time() - data_start_time
                    msg = "Train: Rank: {}, ".format(dist.get_rank())
                    msg += "epoch: {}, ".format(self.epoch)
                    msg += "step: {}, ".format(self.iteration)
                    msg += "batch : {}/{}, ".format(batch_index + 1,
                                                    len(self.train_loader))
                    msg += "lr: {:>.8f}, ".format(self.lr_scheduler())
                    msg += "data time: {:>.3f}s, ".format(dataload_time)
                    self.train_batch(batch_index, batch, msg)
                    data_start_time = time.time()
            except Exception as e:
                logger.error(e)
                raise e
            total_loss, num_seen_utts = self.valid()
            if dist.get_world_size() > 1:
                num_seen_utts = paddle.to_tensor(num_seen_utts)
                # the default operator in all_reduce function is sum.
                dist.all_reduce(num_seen_utts)
                total_loss = paddle.to_tensor(total_loss)
                dist.all_reduce(total_loss)
                cv_loss = total_loss / num_seen_utts
                cv_loss = float(cv_loss)
            else:
                cv_loss = total_loss / num_seen_utts
            logger.info(
                'Epoch {} Val info val_loss {}'.format(self.epoch, cv_loss))
            if self.visualizer:
                self.visualizer.add_scalars(
                    'epoch', {'cv_loss': cv_loss,
                              'lr': self.lr_scheduler()}, self.epoch)
            self.save(tag=self.epoch, infos={'val_loss': cv_loss})
            self.new_epoch()
    def setup_dataloader(self):
        config = self.config.clone()
        config.defrost()
        config.data.keep_transcription_text = False
        # train/valid dataset, return token ids
        config.data.manifest = config.data.train_manifest
        train_dataset = ManifestDataset.from_config(config)
        config.data.manifest = config.data.dev_manifest
        config.data.augmentation_config = ""
        dev_dataset = ManifestDataset.from_config(config)
        collate_fn = SpeechCollator(keep_transcription_text=False)
        if self.parallel:
            batch_sampler = SortagradDistributedBatchSampler(
                train_dataset,
                batch_size=config.data.batch_size,
                num_replicas=None,
                rank=None,
                shuffle=True,
                drop_last=True,
                sortagrad=config.data.sortagrad,
                shuffle_method=config.data.shuffle_method)
        else:
            batch_sampler = SortagradBatchSampler(
                train_dataset,
                shuffle=True,
                batch_size=config.data.batch_size,
                drop_last=True,
                sortagrad=config.data.sortagrad,
                shuffle_method=config.data.shuffle_method)
        self.train_loader = DataLoader(
            train_dataset,
            batch_sampler=batch_sampler,
            collate_fn=collate_fn,
            num_workers=config.data.num_workers, )
        self.valid_loader = DataLoader(
            dev_dataset,
            batch_size=config.data.batch_size,
            shuffle=False,
            drop_last=False,
            collate_fn=collate_fn)
        # test dataset, return raw text
        config.data.manifest = config.data.test_manifest
        config.data.keep_transcription_text = True
        config.data.augmentation_config = ""
        # filter test examples, will cause less examples, but no mismatch with training
        # and can use large batch size , save training time, so filter test egs now.
        # config.data.min_input_len = 0.0  # second
        # config.data.max_input_len = float('inf')  # second
        # config.data.min_output_len = 0.0  # tokens
        # config.data.max_output_len = float('inf')  # tokens
        # config.data.min_output_input_ratio = 0.00
        # config.data.max_output_input_ratio = float('inf')
        test_dataset = ManifestDataset.from_config(config)
        # return text ord id
        self.test_loader = DataLoader(
            test_dataset,
            batch_size=config.decoding.batch_size,
            shuffle=False,
            drop_last=False,
            collate_fn=SpeechCollator(keep_transcription_text=True))
        logger.info("Setup train/valid/test Dataloader!")
    def setup_model(self):
        config = self.config
        model_conf = config.model
        model_conf.defrost()
        model_conf.input_dim = self.train_loader.dataset.feature_size
        model_conf.output_dim = self.train_loader.dataset.vocab_size
        model_conf.freeze()
        model = U2Model.from_config(model_conf)
        if self.parallel:
            model = paddle.DataParallel(model)
        logger.info(f"{model}")
        layer_tools.print_params(model, logger.info)
        train_config = config.training
        optim_type = train_config.optim
        optim_conf = train_config.optim_conf
        scheduler_type = train_config.scheduler
        scheduler_conf = train_config.scheduler_conf
        grad_clip = ClipGradByGlobalNormWithLog(train_config.global_grad_clip)
        weight_decay = paddle.regularizer.L2Decay(optim_conf.weight_decay)
        if scheduler_type == 'expdecaylr':
            lr_scheduler = paddle.optimizer.lr.ExponentialDecay(
                learning_rate=optim_conf.lr,
                gamma=scheduler_conf.lr_decay,
                verbose=False)
        elif scheduler_type == 'warmuplr':
            lr_scheduler = WarmupLR(
                learning_rate=optim_conf.lr,
                warmup_steps=scheduler_conf.warmup_steps,
                verbose=False)
        else:
            raise ValueError(f"Not support scheduler: {scheduler_type}")
        if optim_type == 'adam':
            optimizer = paddle.optimizer.Adam(
                learning_rate=lr_scheduler,
                parameters=model.parameters(),
                weight_decay=weight_decay,
                grad_clip=grad_clip)
        else:
            raise ValueError(f"Not support optim: {optim_type}")
        self.model = model
        self.optimizer = optimizer
        self.lr_scheduler = lr_scheduler
        logger.info("Setup model/optimizer/lr_scheduler!")
 class U2Tester(U2Trainer):
    @classmethod
    def params(cls, config: Optional[CfgNode]=None) -> CfgNode:
        # decoding config
        default = CfgNode(
            dict(
                alpha=2.5,  # Coef of LM for beam search.
                beta=0.3,  # Coef of WC for beam search.
                cutoff_prob=1.0,  # Cutoff probability for pruning.
                cutoff_top_n=40,  # Cutoff number for pruning.
                lang_model_path='models/lm/common_crawl_00.prune01111.trie.klm',  # Filepath for language model.
                decoding_method='attention',  # Decoding method. Options: 'attention', 'ctc_greedy_search',
                # 'ctc_prefix_beam_search', 'attention_rescoring'
                error_rate_type='wer',  # Error rate type for evaluation. Options `wer`, 'cer'
                num_proc_bsearch=8,  # # of CPUs for beam search.
                beam_size=10,  # Beam search width.
                batch_size=16,  # decoding batch size
                ctc_weight=0.0,  # ctc weight for attention rescoring decode mode.
                decoding_chunk_size=-1,  # decoding chunk size. Defaults to -1.
                # <0: for decoding, use full chunk.
                # >0: for decoding, use fixed chunk size as set.
                # 0: used for training, it's prohibited here. 
                num_decoding_left_chunks=-1,  # number of left chunks for decoding. Defaults to -1.
                simulate_streaming=False,  # simulate streaming inference. Defaults to False.
            ))
        if config is not None:
            config.merge_from_other_cfg(default)
        return default
    def __init__(self, config, args):
        super().__init__(config, args)
    def ordid2token(self, texts, texts_len):
        """ ord() id to chr() chr """
        trans = []
        for text, n in zip(texts, texts_len):
            n = n.numpy().item()
            ids = text[:n]
            trans.append(''.join([chr(i) for i in ids]))
        return trans
    def compute_metrics(self, audio, audio_len, texts, texts_len, fout=None):
        cfg = self.config.decoding
        errors_sum, len_refs, num_ins = 0.0, 0, 0
        errors_func = error_rate.char_errors if cfg.error_rate_type == 'cer' else error_rate.word_errors
        error_rate_func = error_rate.cer if cfg.error_rate_type == 'cer' else error_rate.wer
        start_time = time.time()
        text_feature = self.test_loader.dataset.text_feature
        target_transcripts = self.ordid2token(texts, texts_len)
        result_transcripts = self.model.decode(
            audio,
            audio_len,
            text_feature=text_feature,
            decoding_method=cfg.decoding_method,
            lang_model_path=cfg.lang_model_path,
            beam_alpha=cfg.alpha,
            beam_beta=cfg.beta,
            beam_size=cfg.beam_size,
            cutoff_prob=cfg.cutoff_prob,
            cutoff_top_n=cfg.cutoff_top_n,
            num_processes=cfg.num_proc_bsearch,
            ctc_weight=cfg.ctc_weight,
            decoding_chunk_size=cfg.decoding_chunk_size,
            num_decoding_left_chunks=cfg.num_decoding_left_chunks,
            simulate_streaming=cfg.simulate_streaming)
        decode_time = time.time() - start_time
        for target, result in zip(target_transcripts, result_transcripts):
            errors, len_ref = errors_func(target, result)
            errors_sum += errors
            len_refs += len_ref
            num_ins += 1
            if fout:
                fout.write(result + "\n")
            logger.info("\nTarget Transcription: %s\nOutput Transcription: %s" %
                        (target, result))
            logger.info("One example error rate [%s] = %f" %
                        (cfg.error_rate_type, error_rate_func(target, result)))
        return dict(
            errors_sum=errors_sum,
            len_refs=len_refs,
            num_ins=num_ins,  # num examples
            error_rate=errors_sum / len_refs,
            error_rate_type=cfg.error_rate_type,
            num_frames=audio_len.sum().numpy().item(),
            decode_time=decode_time)
    @mp_tools.rank_zero_only
    @paddle.no_grad()
    def test(self):
        assert self.args.result_file
        self.model.eval()
        logger.info(f"Test Total Examples: {len(self.test_loader.dataset)}")
        stride_ms = self.test_loader.dataset.stride_ms
        error_rate_type = None
        errors_sum, len_refs, num_ins = 0.0, 0, 0
        num_frames = 0.0
        num_time = 0.0
        with open(self.args.result_file, 'w') as fout:
            for i, batch in enumerate(self.test_loader):
                metrics = self.compute_metrics(*batch, fout=fout)
                num_frames += metrics['num_frames']
                num_time += metrics["decode_time"]
                errors_sum += metrics['errors_sum']
                len_refs += metrics['len_refs']
                num_ins += metrics['num_ins']
                error_rate_type = metrics['error_rate_type']
                rtf = num_time / (num_frames * stride_ms)
                logger.info(
                    "RTF: %f, Error rate [%s] (%d/?) = %f" %
                    (rtf, error_rate_type, num_ins, errors_sum / len_refs))
        rtf = num_time / (num_frames * stride_ms)
        msg = "Test: "
        msg += "epoch: {}, ".format(self.epoch)
        msg += "step: {}, ".format(self.iteration)
        msg += "RTF: {}, ".format(rtf)
        msg += "Final error rate [%s] (%d/%d) = %f" % (
            error_rate_type, num_ins, num_ins, errors_sum / len_refs)
        logger.info(msg)
        # test meta results
        err_meta_path = os.path.splitext(self.args.checkpoint_path)[0] + '.err'
        err_type_str = "{}".format(error_rate_type)
        with open(err_meta_path, 'w') as f:
            data = json.dumps({
                "epoch":
                self.epoch,
                "step":
                self.iteration,
                "rtf":
                rtf,
                error_rate_type:
                errors_sum / len_refs,
                "dataset_hour": (num_frames * stride_ms) / 1000.0 / 3600.0,
                "process_hour":
                num_time / 1000.0 / 3600.0,
                "num_examples":
                num_ins,
                "err_sum":
                errors_sum,
                "ref_len":
                len_refs,
            })
            f.write(data + '\n')
    def run_test(self):
        self.resume_or_scratch()
        try:
            self.test()
        except KeyboardInterrupt:
            sys.exit(-1)
    def load_inferspec(self):
        """infer model and input spec.
        Returns:
            nn.Layer: inference model
            List[paddle.static.InputSpec]: input spec.
        """
        from deepspeech.models.u2 import U2InferModel
        infer_model = U2InferModel.from_pretrained(self.test_loader.dataset,
                                                   self.config.model.clone(),
                                                   self.args.checkpoint_path)
        feat_dim = self.test_loader.dataset.feature_size
        input_spec = [
            paddle.static.InputSpec(
                shape=[None, feat_dim, None],
                dtype='float32'),  # audio, [B,D,T]
            paddle.static.InputSpec(shape=[None],
                                    dtype='int64'),  # audio_length, [B]
        ]
        return infer_model, input_spec
    def export(self):
        infer_model, input_spec = self.load_inferspec()
        assert isinstance(input_spec, list), type(input_spec)
        infer_model.eval()
        static_model = paddle.jit.to_static(infer_model, input_spec=input_spec)
        logger.info(f"Export code: {static_model.forward.code}")
        paddle.jit.save(static_model, self.args.export_path)
    def run_export(self):
        try:
            self.export()
        except KeyboardInterrupt:
            sys.exit(-1)
    def setup(self):
        """Setup the experiment.
        """
        paddle.set_device(self.args.device)
        self.setup_output_dir()
        self.setup_checkpointer()
        self.setup_dataloader()
        self.setup_model()
        self.iteration = 0
        self.epoch = 0
    def setup_output_dir(self):
        """Create a directory used for output.
        """
        # output dir
        if self.args.output:
            output_dir = Path(self.args.output).expanduser()
            output_dir.mkdir(parents=True, exist_ok=True)
        else:
            output_dir = Path(
                self.args.checkpoint_path).expanduser().parent.parent
            output_dir.mkdir(parents=True, exist_ok=True)
        self.output_dir = output_dir
--- a/deepspeech/frontend/audio.py
+++ b/deepspeech/frontend/audio.py
@ -12,17 +12,16 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 """Contains the audio segment class."""
-
+import copy
 import numpy as np
 import io
-import struct
+import random
 import re
-import soundfile
+import struct
 import numpy as np
 import resampy
 import soundfile
 from scipy import signal
 import random
 import copy
 import io
 class AudioSegment(object):
@ -299,6 +298,18 @@ class AudioSegment(object):
        samples = self._convert_samples_from_float32(self._samples, dtype)
        return samples.tostring()
    def to(self, dtype='int16'):
        """Create a `dtype` audio content.
        :param dtype: Data type for export samples. Options: 'int16', 'int32',
                      'float32', 'float64'. Default is 'float32'.
        :type dtype: str
        :return: np.ndarray containing `dtype` audio content.
        :rtype: str
        """
        samples = self._convert_samples_from_float32(self._samples, dtype)
        return samples
    def gain_db(self, gain):
        """Apply gain in decibels to samples.
@ -322,14 +333,25 @@ class AudioSegment(object):
        :type speed_rate: float
        :raises ValueError: If speed_rate <= 0.0.
        """
        if speed_rate == 1.0:
            return
        if speed_rate <= 0:
            raise ValueError("speed_rate should be greater than zero.")
        # numpy
        old_length = self._samples.shape[0]
        new_length = int(old_length / speed_rate)
        old_indices = np.arange(old_length)
        new_indices = np.linspace(start=0, stop=old_length, num=new_length)
        self._samples = np.interp(new_indices, old_indices, self._samples)
        # sox, slow
        # tfm = sox.Transformer()
        # tfm.set_globals(multithread=False)
        # tfm.speed(speed_rate)
        # self._samples = tfm.build_array(
        #     input_array=self._samples, sample_rate_in=self._sample_rate).copy()
    def normalize(self, target_db=-20, max_gain_db=300.0):
        """Normalize audio to be of the desired RMS value in decibels.
--- a/deepspeech/frontend/augmentor/augmentation.py
+++ b/deepspeech/frontend/augmentor/augmentation.py
@ -12,17 +12,19 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 """Contains the data augmentation pipeline."""
 import json
-import random
+
-from deepspeech.frontend.augmentor.volume_perturb import VolumePerturbAugmentor
+import numpy as np
-from deepspeech.frontend.augmentor.shift_perturb import ShiftPerturbAugmentor
+
 from deepspeech.frontend.augmentor.speed_perturb import SpeedPerturbAugmentor
 from deepspeech.frontend.augmentor.noise_perturb import NoisePerturbAugmentor
 from deepspeech.frontend.augmentor.impulse_response import ImpulseResponseAugmentor
-from deepspeech.frontend.augmentor.resample import ResampleAugmentor
+from deepspeech.frontend.augmentor.noise_perturb import NoisePerturbAugmentor
 from deepspeech.frontend.augmentor.online_bayesian_normalization import \
-     OnlineBayesianNormalizationAugmentor
+    OnlineBayesianNormalizationAugmentor
 from deepspeech.frontend.augmentor.resample import ResampleAugmentor
 from deepspeech.frontend.augmentor.shift_perturb import ShiftPerturbAugmentor
 from deepspeech.frontend.augmentor.spec_augment import SpecAugmentor
 from deepspeech.frontend.augmentor.speed_perturb import SpeedPerturbAugmentor
 from deepspeech.frontend.augmentor.volume_perturb import VolumePerturbAugmentor
 class AugmentationPipeline():
@ -83,10 +85,13 @@ class AugmentationPipeline():
    :raises ValueError: If the augmentation json config is in incorrect format".
    """
-    def __init__(self, augmentation_config, random_seed=0):
+    def __init__(self, augmentation_config: str, random_seed=0):
-        self._rng = random.Random(random_seed)
+        self._rng = np.random.RandomState(random_seed)
        self._spec_types = ('specaug')
        self._augmentors, self._rates = self._parse_pipeline_from(
-            augmentation_config)
+            augmentation_config, 'audio')
        self._spec_augmentors, self._spec_rates = self._parse_pipeline_from(
            augmentation_config, 'feature')
    def transform_audio(self, audio_segment):
        """Run the pre-processing pipeline for data augmentation.
@ -100,15 +105,41 @@ class AugmentationPipeline():
            if self._rng.uniform(0., 1.) < rate:
                augmentor.transform_audio(audio_segment)
-    def _parse_pipeline_from(self, config_json):
+    def transform_feature(self, spec_segment):
        """spectrogram augmentation.
        Args:
            spec_segment (np.ndarray): audio feature, (D, T).
        """
        for augmentor, rate in zip(self._spec_augmentors, self._spec_rates):
            if self._rng.uniform(0., 1.) < rate:
                spec_segment = augmentor.transform_feature(spec_segment)
        return spec_segment
    def _parse_pipeline_from(self, config_json, aug_type='audio'):
        """Parse the config json to build a augmentation pipelien."""
        assert aug_type in ('audio', 'feature'), aug_type
        try:
            configs = json.loads(config_json)
            audio_confs = []
            feature_confs = []
            for config in configs:
                if config["type"] in self._spec_types:
                    feature_confs.append(config)
                else:
                    audio_confs.append(config)
            if aug_type == 'audio':
                aug_confs = audio_confs
            elif aug_type == 'feature':
                aug_confs = feature_confs
            augmentors = [
                self._get_augmentor(config["type"], config["params"])
-                for config in configs
+                for config in aug_confs
            ]
-            rates = [config["prob"] for config in configs]
+            rates = [config["prob"] for config in aug_confs]
        except Exception as e:
            raise ValueError("Failed to parse the augmentation config json: "
                             "%s" % str(e))
@ -130,5 +161,7 @@ class AugmentationPipeline():
            return NoisePerturbAugmentor(self._rng, **params)
        elif augmentor_type == "impulse":
            return ImpulseResponseAugmentor(self._rng, **params)
        elif augmentor_type == "specaug":
            return SpecAugmentor(self._rng, **params)
        else:
            raise ValueError("Unknown augmentor type [%s]." % augmentor_type)
--- a/deepspeech/frontend/augmentor/base.py
+++ b/deepspeech/frontend/augmentor/base.py
@ -12,8 +12,8 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 """Contains the abstract base class for augmentation models."""
-
+from abc import ABCMeta
-from abc import ABCMeta, abstractmethod
+from abc import abstractmethod
 class AugmentorBase():
@ -40,4 +40,16 @@ class AugmentorBase():
        :param audio_segment: Audio segment to add effects to.
        :type audio_segment: AudioSegmenet|SpeechSegment
        """
-        pass
+        raise NotImplementedError
    @abstractmethod
    def transform_feature(self, spec_segment):
        """Adds various effects to the input audo feature segment. Such effects
        will augment the training data to make the model invariant to certain
        types of time_mask or freq_mask in the real world, improving model's
        generalization ability.
        Args:
            spec_segment (Spectrogram): Spectrogram segment to add effects to.
        """
        raise NotImplementedError
--- a/deepspeech/frontend/augmentor/impulse_response.py
+++ b/deepspeech/frontend/augmentor/impulse_response.py
@ -12,10 +12,9 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 """Contains the impulse response augmentation model."""
-
+from deepspeech.frontend.audio import AudioSegment
 from deepspeech.frontend.augmentor.base import AugmentorBase
 from deepspeech.frontend.utility import read_manifest
 from deepspeech.frontend.audio import AudioSegment
 class ImpulseResponseAugmentor(AugmentorBase):
@ -39,6 +38,7 @@ class ImpulseResponseAugmentor(AugmentorBase):
        :param audio_segment: Audio segment to add effects to.
        :type audio_segment: AudioSegmenet|SpeechSegment
        """
-        impulse_json = self._rng.sample(self._impulse_manifest, 1)[0]
+        impulse_json = self._rng.choice(
            self._impulse_manifest, 1, replace=False)[0]
        impulse_segment = AudioSegment.from_file(impulse_json['audio_filepath'])
        audio_segment.convolve(impulse_segment, allow_resample=True)
--- a/deepspeech/frontend/augmentor/noise_perturb.py
+++ b/deepspeech/frontend/augmentor/noise_perturb.py
@ -12,10 +12,9 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 """Contains the noise perturb augmentation model."""
-
+from deepspeech.frontend.audio import AudioSegment
 from deepspeech.frontend.augmentor.base import AugmentorBase
 from deepspeech.frontend.utility import read_manifest
 from deepspeech.frontend.audio import AudioSegment
 class NoisePerturbAugmentor(AugmentorBase):
@ -45,7 +44,7 @@ class NoisePerturbAugmentor(AugmentorBase):
        :param audio_segment: Audio segment to add effects to.
        :type audio_segment: AudioSegmenet|SpeechSegment
        """
-        noise_json = self._rng.sample(self._noise_manifest, 1)[0]
+        noise_json = self._rng.choice(self._noise_manifest, 1, replace=False)[0]
        if noise_json['duration'] < audio_segment.duration:
            raise RuntimeError("The duration of sampled noise audio is smaller "
                               "than the audio segment to add effects to.")
--- a/deepspeech/frontend/augmentor/online_bayesian_normalization.py
+++ b/deepspeech/frontend/augmentor/online_bayesian_normalization.py
@ -12,7 +12,6 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 """Contain the online bayesian normalization augmentation model."""
 from deepspeech.frontend.augmentor.base import AugmentorBase
--- a/deepspeech/frontend/augmentor/resample.py
+++ b/deepspeech/frontend/augmentor/resample.py
@ -12,7 +12,6 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 """Contain the resample augmentation model."""
 from deepspeech.frontend.augmentor.base import AugmentorBase
--- a/deepspeech/frontend/augmentor/shift_perturb.py
+++ b/deepspeech/frontend/augmentor/shift_perturb.py
@ -12,7 +12,6 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 """Contains the volume perturb augmentation model."""
 from deepspeech.frontend.augmentor.base import AugmentorBase
--- a/deepspeech/frontend/augmentor/spec_augment.py
+++ b/deepspeech/frontend/augmentor/spec_augment.py
@ -0,0 +1,170 @@
 # Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 """Contains the volume perturb augmentation model."""
 import numpy as np
 from deepspeech.frontend.augmentor.base import AugmentorBase
 from deepspeech.utils.log import Log
 logger = Log(__name__).getlog()
 class SpecAugmentor(AugmentorBase):
    """Augmentation model for Time warping, Frequency masking, Time masking.
    SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition
        https://arxiv.org/abs/1904.08779
    SpecAugment on Large Scale Datasets
        https://arxiv.org/abs/1912.05533
    """
    def __init__(self,
                 rng,
                 F,
                 T,
                 n_freq_masks,
                 n_time_masks,
                 p=1.0,
                 W=40,
                 adaptive_number_ratio=0,
                 adaptive_size_ratio=0,
                 max_n_time_masks=20):
        """SpecAugment class.
        Args:
            rng (random.Random): random generator object.
            F (int): parameter for frequency masking
            T (int): parameter for time masking
            n_freq_masks (int): number of frequency masks
            n_time_masks (int): number of time masks
            p (float): parameter for upperbound of the time mask
            W (int): parameter for time warping
            adaptive_number_ratio (float): adaptive multiplicity ratio for time masking
            adaptive_size_ratio (float): adaptive size ratio for time masking
            max_n_time_masks (int): maximum number of time masking
        """
        super().__init__()
        self._rng = rng
        self.W = W
        self.F = F
        self.T = T
        self.n_freq_masks = n_freq_masks
        self.n_time_masks = n_time_masks
        self.p = p
        #logger.info(f"specaug: F-{F}, T-{T}, F-n-{n_freq_masks}, T-n-{n_time_masks}")
        # adaptive SpecAugment
        self.adaptive_number_ratio = adaptive_number_ratio
        self.adaptive_size_ratio = adaptive_size_ratio
        self.max_n_time_masks = max_n_time_masks
        if adaptive_number_ratio > 0:
            self.n_time_masks = 0
            logger.info('n_time_masks is set ot zero for adaptive SpecAugment.')
        if adaptive_size_ratio > 0:
            self.T = 0
            logger.info('T is set to zero for adaptive SpecAugment.')
        self._freq_mask = None
        self._time_mask = None
    def librispeech_basic(self):
        self.W = 80
        self.F = 27
        self.T = 100
        self.n_freq_masks = 1
        self.n_time_masks = 1
        self.p = 1.0
    def librispeech_double(self):
        self.W = 80
        self.F = 27
        self.T = 100
        self.n_freq_masks = 2
        self.n_time_masks = 2
        self.p = 1.0
    def switchboard_mild(self):
        self.W = 40
        self.F = 15
        self.T = 70
        self.n_freq_masks = 2
        self.n_time_masks = 2
        self.p = 0.2
    def switchboard_strong(self):
        self.W = 40
        self.F = 27
        self.T = 70
        self.n_freq_masks = 2
        self.n_time_masks = 2
        self.p = 0.2
    @property
    def freq_mask(self):
        return self._freq_mask
    @property
    def time_mask(self):
        return self._time_mask
    def time_warp(xs, W=40):
        raise NotImplementedError
    def mask_freq(self, xs, replace_with_zero=False):
        n_bins = xs.shape[0]
        for i in range(0, self.n_freq_masks):
            f = int(self._rng.uniform(low=0, high=self.F))
            f_0 = int(self._rng.uniform(low=0, high=n_bins - f))
            xs[f_0:f_0 + f, :] = 0
            assert f_0 <= f_0 + f
            self._freq_mask = (f_0, f_0 + f)
        return xs
    def mask_time(self, xs, replace_with_zero=False):
        n_frames = xs.shape[1]
        if self.adaptive_number_ratio > 0:
            n_masks = int(n_frames * self.adaptive_number_ratio)
            n_masks = min(n_masks, self.max_n_time_masks)
        else:
            n_masks = self.n_time_masks
        if self.adaptive_size_ratio > 0:
            T = self.adaptive_size_ratio * n_frames
        else:
            T = self.T
        for i in range(n_masks):
            t = int(self._rng.uniform(low=0, high=T))
            t = min(t, int(n_frames * self.p))
            t_0 = int(self._rng.uniform(low=0, high=n_frames - t))
            xs[:, t_0:t_0 + t] = 0
            assert t_0 <= t_0 + t
            self._time_mask = (t_0, t_0 + t)
        return xs
    def transform_feature(self, xs: np.ndarray):
        """
        Args:
            xs (FloatTensor): `[F, T]`
        Returns:
            xs (FloatTensor): `[F, T]`
        """
        # xs = self.time_warp(xs)
        xs = self.mask_freq(xs)
        xs = self.mask_time(xs)
        return xs
--- a/deepspeech/frontend/augmentor/speed_perturb.py
+++ b/deepspeech/frontend/augmentor/speed_perturb.py
@ -12,36 +12,72 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 """Contain the speech perturbation augmentation model."""
 import numpy as np
 from deepspeech.frontend.augmentor.base import AugmentorBase
 class SpeedPerturbAugmentor(AugmentorBase):
-    """Augmentation model for adding speed perturbation.
+    """Augmentation model for adding speed perturbation."""
-
+
-    See reference paper here:
+    def __init__(self, rng, min_speed_rate=0.9, max_speed_rate=1.1,
-    http://www.danielpovey.com/files/2015_interspeech_augmentation.pdf
+                 num_rates=3):
-
+        """speed perturbation.
-    :param rng: Random generator object.
+        
-    :type rng: random.Random
+        The speed perturbation in kaldi uses sox-speed instead of sox-tempo,
-    :param min_speed_rate: Lower bound of new speed rate to sample and should
+        and sox-speed just to resample the input,
-                           not be smaller than 0.9.
+        i.e pitch and tempo are changed both.
-    :type min_speed_rate: float
+
-    :param max_speed_rate: Upper bound of new speed rate to sample and should
+        "Why use speed option instead of tempo -s in SoX for speed perturbation"
-                           not be larger than 1.1.
+        https://groups.google.com/forum/#!topic/kaldi-help/8OOG7eE4sZ8
-    :type max_speed_rate: float
+    
-    """
+        Sox speed:
-
+        https://pysox.readthedocs.io/en/latest/api.html#sox.transform.Transformer
-    def __init__(self, rng, min_speed_rate, max_speed_rate):
+        
        See reference paper here:
        http://www.danielpovey.com/files/2015_interspeech_augmentation.pdf
        Espnet:
        https://espnet.github.io/espnet/_modules/espnet/transform/perturb.html
        Nemo:
        https://github.com/NVIDIA/NeMo/blob/main/nemo/collections/asr/parts/perturb.py#L92
        Args:
            rng (random.Random): Random generator object.
            min_speed_rate (float): Lower bound of new speed rate to sample and should
                not be smaller than 0.9.
            max_speed_rate (float): Upper bound of new speed rate to sample and should
                not be larger than 1.1.
            num_rates (int, optional): Number of discrete rates to allow. 
                Can be a positive or negative integer. Defaults to 3.
                If a positive integer greater than 0 is provided, the range of
                speed rates will be discretized into `num_rates` values.
                If a negative integer or 0 is provided, the full range of speed rates
                will be sampled uniformly.
                Note: If a positive integer is provided and the resultant discretized
                range of rates contains the value '1.0', then those samples with rate=1.0,
                will not be augmented at all and simply skipped. This is to unnecessary
                augmentation and increase computation time. Effective augmentation chance
                in such a case is = `prob * (num_rates - 1 / num_rates) * 100`% chance
                where `prob` is the global probability of a sample being augmented.
        Raises:
            ValueError: when speed_rate error
        """
        if min_speed_rate < 0.9:
            raise ValueError(
                "Sampling speed below 0.9 can cause unnatural effects")
        if max_speed_rate > 1.1:
            raise ValueError(
                "Sampling speed above 1.1 can cause unnatural effects")
-        self._min_speed_rate = min_speed_rate
+        self._min_rate = min_speed_rate
-        self._max_speed_rate = max_speed_rate
+        self._max_rate = max_speed_rate
        self._rng = rng
        self._num_rates = num_rates
        if num_rates > 0:
            self._rates = np.linspace(
                self._min_rate, self._max_rate, self._num_rates, endpoint=True)
    def transform_audio(self, audio_segment):
        """Sample a new speed rate from the given range and
@ -52,6 +88,13 @@ class SpeedPerturbAugmentor(AugmentorBase):
        :param audio_segment: Audio segment to add effects to.
        :type audio_segment: AudioSegment|SpeechSegment
        """
-        sampled_speed = self._rng.uniform(self._min_speed_rate,
+        if self._num_rates < 0:
-                                          self._max_speed_rate)
+            speed_rate = self._rng.uniform(self._min_rate, self._max_rate)
-        audio_segment.change_speed(sampled_speed)
+        else:
            speed_rate = self._rng.choice(self._rates)
        # Skip perturbation in case of identity speed rate
        if speed_rate == 1.0:
            return
        audio_segment.change_speed(speed_rate)
--- a/deepspeech/frontend/augmentor/volume_perturb.py
+++ b/deepspeech/frontend/augmentor/volume_perturb.py
@ -12,7 +12,6 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 """Contains the volume perturb augmentation model."""
 from deepspeech.frontend.augmentor.base import AugmentorBase
--- a/deepspeech/frontend/featurizer/audio_featurizer.py
+++ b/deepspeech/frontend/featurizer/audio_featurizer.py
@ -12,12 +12,10 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 """Contains the audio featurizer class."""
 import numpy as np
 from deepspeech.frontend.utility import read_manifest
 from deepspeech.frontend.audio import AudioSegment
 from python_speech_features import mfcc
 from python_speech_features import delta
 from python_speech_features import logfbank
 from python_speech_features import mfcc
 class AudioFeaturizer(object):
@ -49,15 +47,22 @@ class AudioFeaturizer(object):
    """
    def __init__(self,
-                 specgram_type='linear',
+                 specgram_type: str='linear',
                 feat_dim: int=None,
                 delta_delta: bool=False,
                 stride_ms=10.0,
                 window_ms=20.0,
                 n_fft=None,
                 max_freq=None,
                 target_sample_rate=16000,
                 use_dB_normalization=True,
-                 target_dB=-20):
+                 target_dB=-20,
                 dither=1.0):
        self._specgram_type = specgram_type
        # mfcc and fbank using `feat_dim`
        self._feat_dim = feat_dim
        # mfcc and fbank using `delta-delta`
        self._delta_delta = delta_delta
        self._stride_ms = stride_ms
        self._window_ms = window_ms
        self._max_freq = max_freq
@ -65,6 +70,7 @@ class AudioFeaturizer(object):
        self._use_dB_normalization = use_dB_normalization
        self._target_dB = target_dB
        self._fft_point = n_fft
        self._dither = dither
    def featurize(self,
                  audio_segment,
@ -97,8 +103,11 @@ class AudioFeaturizer(object):
        if self._use_dB_normalization:
            audio_segment.normalize(target_db=self._target_dB)
        # extract spectrogram
-        return self._compute_specgram(audio_segment.samples,
+        return self._compute_specgram(audio_segment)
-                                      audio_segment.sample_rate)
+
    @property
    def stride_ms(self):
        return self._stride_ms
    @property
    def feature_size(self):
@ -109,22 +118,51 @@ class AudioFeaturizer(object):
            feat_dim = int(fft_point * (self._target_sample_rate / 1000) / 2 +
                           1)
        elif self._specgram_type == 'mfcc':
-            # mfcc,delta, delta-delta
+            # mfcc, delta, delta-delta
-            feat_dim = int(13 * 3)
+            feat_dim = int(self._feat_dim *
                           3) if self._delta_delta else int(self._feat_dim)
        elif self._specgram_type == 'fbank':
            # fbank, delta, delta-delta
            feat_dim = int(self._feat_dim *
                           3) if self._delta_delta else int(self._feat_dim)
        else:
            raise ValueError("Unknown specgram_type %s. "
                             "Supported values: linear." % self._specgram_type)
        return feat_dim
-    def _compute_specgram(self, samples, sample_rate):
+    def _compute_specgram(self, audio_segment):
        """Extract various audio features."""
        sample_rate = audio_segment.sample_rate
        if self._specgram_type == 'linear':
            samples = audio_segment.samples
            return self._compute_linear_specgram(
-                samples, sample_rate, self._stride_ms, self._window_ms,
+                samples,
-                self._max_freq)
+                sample_rate,
                stride_ms=self._stride_ms,
                window_ms=self._window_ms,
                max_freq=self._max_freq)
        elif self._specgram_type == 'mfcc':
-            return self._compute_mfcc(samples, sample_rate, self._stride_ms,
+            samples = audio_segment.to('int16')
-                                      self._window_ms, self._max_freq)
+            return self._compute_mfcc(
                samples,
                sample_rate,
                feat_dim=self._feat_dim,
                stride_ms=self._stride_ms,
                window_ms=self._window_ms,
                max_freq=self._max_freq,
                dither=self._dither,
                delta_delta=self._delta_delta)
        elif self._specgram_type == 'fbank':
            samples = audio_segment.to('int16')
            return self._compute_fbank(
                samples,
                sample_rate,
                feat_dim=self._feat_dim,
                stride_ms=self._stride_ms,
                window_ms=self._window_ms,
                max_freq=self._max_freq,
                dither=self._dither,
                delta_delta=self._delta_delta)
        else:
            raise ValueError("Unknown specgram_type %s. "
                             "Supported values: linear." % self._specgram_type)
@ -179,13 +217,55 @@ class AudioFeaturizer(object):
        freqs = float(sample_rate) / window_size * np.arange(fft.shape[0])
        return fft, freqs
    def _concat_delta_delta(self, feat):
        """append delat, delta-delta feature.
        Args:
            feat (np.ndarray): (D, T)
        Returns:
            np.ndarray: feat with delta-delta, (3*D, T)
        """
        feat = np.transpose(feat)
        # Deltas
        d_feat = delta(feat, 2)
        # Deltas-Deltas
        dd_feat = delta(feat, 2)
        # transpose
        feat = np.transpose(feat)
        d_feat = np.transpose(d_feat)
        dd_feat = np.transpose(dd_feat)
        # concat above three features
        concat_feat = np.concatenate((feat, d_feat, dd_feat))
        return concat_feat
    def _compute_mfcc(self,
                      samples,
                      sample_rate,
                      feat_dim=13,
                      stride_ms=10.0,
-                      window_ms=20.0,
+                      window_ms=25.0,
-                      max_freq=None):
+                      max_freq=None,
-        """Compute mfcc from samples."""
+                      dither=1.0,
                      delta_delta=True):
        """Compute mfcc from samples.
        Args:
            samples (np.ndarray, np.int16): the audio signal from which to compute features.
            sample_rate (float): the sample rate of the signal we are working with, in Hz.
            feat_dim (int): the number of cepstrum to return, default 13.
            stride_ms (float, optional): stride length in ms. Defaults to 10.0.
            window_ms (float, optional): window length in ms. Defaults to 25.0.
            max_freq ([type], optional): highest band edge of mel filters. In Hz, default is samplerate/2. Defaults to None.
            delta_delta (bool, optional): Whether with delta delta. Defaults to False.
        Raises:
            ValueError: max_freq > samplerate/2
            ValueError: stride_ms > window_ms
        Returns:
            np.ndarray: mfcc feature, (D, T).
        """
        if max_freq is None:
            max_freq = sample_rate / 2
        if max_freq > sample_rate / 2:
@ -195,22 +275,79 @@ class AudioFeaturizer(object):
            raise ValueError("Stride size must not be greater than "
                             "window size.")
        # compute the 13 cepstral coefficients, and the first one is replaced
-        # by log(frame energy)
+        # by log(frame energy), (T, D)
        mfcc_feat = mfcc(
            signal=samples,
            samplerate=sample_rate,
            winlen=0.001 * window_ms,
            winstep=0.001 * stride_ms,
-            highfreq=max_freq)
+            numcep=feat_dim,
-        # Deltas
+            nfilt=23,
-        d_mfcc_feat = delta(mfcc_feat, 2)
+            nfft=512,
-        # Deltas-Deltas
+            lowfreq=20,
-        dd_mfcc_feat = delta(d_mfcc_feat, 2)
+            highfreq=max_freq,
-        # transpose
+            dither=dither,
            remove_dc_offset=True,
            preemph=0.97,
            ceplifter=22,
            useEnergy=True,
            winfunc='povey')
        mfcc_feat = np.transpose(mfcc_feat)
-        d_mfcc_feat = np.transpose(d_mfcc_feat)
+        if delta_delta:
-        dd_mfcc_feat = np.transpose(dd_mfcc_feat)
+            mfcc_feat = self._concat_delta_delta(mfcc_feat)
-        # concat above three features
+        return mfcc_feat
-        concat_mfcc_feat = np.concatenate(
+
-            (mfcc_feat, d_mfcc_feat, dd_mfcc_feat))
+    def _compute_fbank(self,
-        return concat_mfcc_feat
+                       samples,
                       sample_rate,
                       feat_dim=40,
                       stride_ms=10.0,
                       window_ms=25.0,
                       max_freq=None,
                       dither=1.0,
                       delta_delta=False):
        """Compute logfbank from samples.
        Args:
            samples (np.ndarray, np.int16): the audio signal from which to compute features. Should be an N*1 array
            sample_rate (float): the sample rate of the signal we are working with, in Hz.
            feat_dim (int): the number of cepstrum to return, default 13.
            stride_ms (float, optional): stride length in ms. Defaults to 10.0.
            window_ms (float, optional): window length in ms. Defaults to 20.0.
            max_freq (float, optional): highest band edge of mel filters. In Hz, default is samplerate/2. Defaults to None.
            delta_delta (bool, optional): Whether with delta delta. Defaults to False.
        Raises:
            ValueError: max_freq > samplerate/2
            ValueError: stride_ms > window_ms
        Returns:
            np.ndarray: mfcc feature, (D, T).
        """
        if max_freq is None:
            max_freq = sample_rate / 2
        if max_freq > sample_rate / 2:
            raise ValueError("max_freq must not be greater than half of "
                             "sample rate.")
        if stride_ms > window_ms:
            raise ValueError("Stride size must not be greater than "
                             "window size.")
        # (T, D)
        fbank_feat = logfbank(
            signal=samples,
            samplerate=sample_rate,
            winlen=0.001 * window_ms,
            winstep=0.001 * stride_ms,
            nfilt=feat_dim,
            nfft=512,
            lowfreq=20,
            highfreq=max_freq,
            dither=dither,
            remove_dc_offset=True,
            preemph=0.97,
            wintype='povey')
        fbank_feat = np.transpose(fbank_feat)
        if delta_delta:
            fbank_feat = self._concat_delta_delta(fbank_feat)
        return fbank_feat
--- a/deepspeech/frontend/featurizer/speech_featurizer.py
+++ b/deepspeech/frontend/featurizer/speech_featurizer.py
@ -12,7 +12,6 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 """Contains the speech featurizer class."""
 from deepspeech.frontend.featurizer.audio_featurizer import AudioFeaturizer
 from deepspeech.frontend.featurizer.text_featurizer import TextFeaturizer
@ -52,25 +51,34 @@ class SpeechFeaturizer(object):
    """
    def __init__(self,
                 unit_type,
                 vocab_filepath,
                 spm_model_prefix=None,
                 specgram_type='linear',
                 feat_dim=None,
                 delta_delta=False,
                 stride_ms=10.0,
                 window_ms=20.0,
                 n_fft=None,
                 max_freq=None,
                 target_sample_rate=16000,
                 use_dB_normalization=True,
-                 target_dB=-20):
+                 target_dB=-20,
                 dither=1.0):
        self._audio_featurizer = AudioFeaturizer(
            specgram_type=specgram_type,
            feat_dim=feat_dim,
            delta_delta=delta_delta,
            stride_ms=stride_ms,
            window_ms=window_ms,
            n_fft=n_fft,
            max_freq=max_freq,
            target_sample_rate=target_sample_rate,
            use_dB_normalization=use_dB_normalization,
-            target_dB=target_dB)
+            target_dB=target_dB,
-        self._text_featurizer = TextFeaturizer(vocab_filepath)
+            dither=dither)
        self._text_featurizer = TextFeaturizer(unit_type, vocab_filepath,
                                               spm_model_prefix)
    def featurize(self, speech_segment, keep_transcription_text):
        """Extract features for speech segment.
@ -79,24 +87,29 @@ class SpeechFeaturizer(object):
        2. For transcript parts, keep the original text or convert text string
           to a list of token indices in char-level.
-        :param audio_segment: Speech segment to extract features from.
+        Args:
-        :type audio_segment: SpeechSegment
+            speech_segment (SpeechSegment): Speech segment to extract features from.
-        :return: A tuple of 1) spectrogram audio feature in 2darray, 2) list of
+            keep_transcription_text (bool): True, keep transcript text, False, token ids
-                 char-level token indices.
+
-        :rtype: tuple
+        Returns:
            tuple: 1) spectrogram audio feature in 2darray, 2) list oftoken indices.
        """
-        audio_feature = self._audio_featurizer.featurize(speech_segment)
+        spec_feature = self._audio_featurizer.featurize(speech_segment)
        if keep_transcription_text:
-            return audio_feature, speech_segment.transcript
+            return spec_feature, speech_segment.transcript
-        text_ids = self._text_featurizer.featurize(speech_segment.transcript)
+        if speech_segment.has_token:
-        return audio_feature, text_ids
+            text_ids = speech_segment.token_ids
        else:
            text_ids = self._text_featurizer.featurize(
                speech_segment.transcript)
        return spec_feature, text_ids
    @property
    def vocab_size(self):
        """Return the vocabulary size.
-        :return: Vocabulary size.
+        Returns:
-        :rtype: int
+            int: Vocabulary size.
        """
        return self._text_featurizer.vocab_size
@ -104,16 +117,43 @@ class SpeechFeaturizer(object):
    def vocab_list(self):
        """Return the vocabulary in list.
-        :return: Vocabulary in list.
+        Returns:
-        :rtype: list
+            List[str]: 
        """
        return self._text_featurizer.vocab_list
    @property
    def vocab_dict(self):
        """Return the vocabulary in dict.
        Returns:
            Dict[str, int]: 
        """
        return self._text_featurizer.vocab_dict
    @property
    def feature_size(self):
        """Return the audio feature size.
-        :return: audio feature size.
+        Returns:
-        :rtype: int
+            int: audio feature size.
        """
        return self._audio_featurizer.feature_size
    @property
    def stride_ms(self):
        """time length in `ms` unit per frame
        Returns:
            float: time(ms)/frame
        """
        return self._audio_featurizer.stride_ms
    @property
    def text_feature(self):
        """Return the text feature object.
        Returns:
            TextFeaturizer: object.
        """
-        return self._audio_featurizer.feature_size
+        return self._text_featurizer
--- a/deepspeech/frontend/featurizer/text_featurizer.py
+++ b/deepspeech/frontend/featurizer/text_featurizer.py
@ -12,44 +12,91 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 """Contains the text featurizer class."""
 import sentencepiece as spm
-import os
+from deepspeech.frontend.utility import EOS
-import codecs
+from deepspeech.frontend.utility import UNK
 class TextFeaturizer(object):
-    """Text featurizer, for processing or extracting features from text.
+    def __init__(self, unit_type, vocab_filepath, spm_model_prefix=None):
        """Text featurizer, for processing or extracting features from text.
-    Currently, it only supports char-level tokenizing and conversion into
+        Currently, it supports char/word/sentence-piece level tokenizing and conversion into
-    a list of token indices. Note that the token indexing order follows the
+        a list of token indices. Note that the token indexing order follows the
-    given vocabulary file.
+        given vocabulary file.
-    :param vocab_filepath: Filepath to load vocabulary for token indices
+        Args:
-                           conversion.
+            unit_type (str): unit type, e.g. char, word, spm
-    :type specgram_type: str
+            vocab_filepath (str): Filepath to load vocabulary for token indices conversion.
-    """
+            spm_model_prefix (str, optional): spm model prefix. Defaults to None.
        """
        assert unit_type in ('char', 'spm', 'word')
        self.unit_type = unit_type
        self.unk = UNK
        if vocab_filepath:
            self._vocab_dict, self._id2token, self._vocab_list = self._load_vocabulary_from_file(
                vocab_filepath)
            self.unk_id = self._vocab_list.index(self.unk)
            self.eos_id = self._vocab_list.index(EOS)
        if unit_type == 'spm':
            spm_model = spm_model_prefix + '.model'
            self.sp = spm.SentencePieceProcessor()
            self.sp.Load(spm_model)
    def tokenize(self, text):
        if self.unit_type == 'char':
            tokens = self.char_tokenize(text)
        elif self.unit_type == 'word':
            tokens = self.word_tokenize(text)
        else:  # spm
            tokens = self.spm_tokenize(text)
        return tokens
-    def __init__(self, vocab_filepath):
+    def detokenize(self, tokens):
-        self.unk = '<unk>'
+        if self.unit_type == 'char':
-        self._vocab_dict, self._vocab_list = self._load_vocabulary_from_file(
+            text = self.char_detokenize(tokens)
-            vocab_filepath)
+        elif self.unit_type == 'word':
            text = self.word_detokenize(tokens)
        else:  # spm
            text = self.spm_detokenize(tokens)
        return text
    def featurize(self, text):
-        """Convert text string to a list of token indices in char-level.Note
+        """Convert text string to a list of token indices.
        that the token indexing order follows the given vocabulary file.
-        :param text: Text to process.
+        Args:
-        :type text: str
+            text (str): Text to process.
-        :return: List of char-level token indices.
+        
-        :rtype: list
+        Returns:
            List[int]: List of token indices.
        """
-        tokens = self._char_tokenize(text)
+        tokens = self.tokenize(text)
        ids = []
        for token in tokens:
            token = token if token in self._vocab_dict else self.unk
            ids.append(self._vocab_dict[token])
        return ids
    def defeaturize(self, idxs):
        """Convert a list of token indices to text string,
        ignore index after eos_id. 
        Args:
            idxs (List[int]): List of token indices.
        Returns:
            str: Text to process.
        """
        tokens = []
        for idx in idxs:
            if idx == self.eos_id:
                break
            tokens.append(self._id2token[idx])
        text = self.detokenize(tokens)
        return text
    @property
    def vocab_size(self):
        """Return the vocabulary size.
@ -63,21 +110,110 @@ class TextFeaturizer(object):
    def vocab_list(self):
        """Return the vocabulary in list.
-        :return: Vocabulary in list.
+        Returns:
-        :rtype: list
+            List[str]: tokens.
        """
        return self._vocab_list
-    def _char_tokenize(self, text):
+    @property
-        """Character tokenizer."""
+    def vocab_dict(self):
        """Return the vocabulary in dict.
        Returns:
            Dict[str, int]: token str -> int
        """
        return self._vocab_dict
    def char_tokenize(self, text):
        """Character tokenizer.
        Args:
            text (str): text string.
        Returns:
            List[str]: tokens.
        """
        return list(text.strip())
    def char_detokenize(self, tokens):
        """Character detokenizer.
        Args:
            tokens (List[str]): tokens.
        Returns:
           str: text string.
        """
        return "".join(tokens)
    def word_tokenize(self, text):
        """Word tokenizer, separate by <space>."""
        return text.strip().split()
    def word_detokenize(self, tokens):
        """Word detokenizer, separate by <space>."""
        return " ".join(tokens)
    def spm_tokenize(self, text):
        """spm tokenize.
        Args:
            text (str): text string.
        Returns:
            List[str]: sentence pieces str code
        """
        stats = {"num_empty": 0, "num_filtered": 0}
        def valid(line):
            return True
        def encode(l):
            return self.sp.EncodeAsPieces(l)
        def encode_line(line):
            line = line.strip()
            if len(line) > 0:
                line = encode(line)
                if valid(line):
                    return line
                else:
                    stats["num_filtered"] += 1
            else:
                stats["num_empty"] += 1
            return None
        enc_line = encode_line(text)
        return enc_line
    def spm_detokenize(self, tokens, input_format='piece'):
        """spm detokenize.
        Args:
            ids (List[str]): tokens.
        Returns:
            str: text
        """
        if input_format == "piece":
            def decode(l):
                return "".join(self.sp.DecodePieces(l))
        elif input_format == "id":
            def decode(l):
                return "".join(self.sp.DecodeIds(l))
        return decode(tokens)
    def _load_vocabulary_from_file(self, vocab_filepath):
        """Load vocabulary from file."""
        vocab_lines = []
-        with codecs.open(vocab_filepath, 'r', 'utf-8') as file:
+        with open(vocab_filepath, 'r', encoding='utf-8') as file:
            vocab_lines.extend(file.readlines())
        vocab_list = [line[:-1] for line in vocab_lines]
-        vocab_dict = dict(
+        id2token = dict(
-            [(token, id) for (id, token) in enumerate(vocab_list)])
+            [(idx, token) for (idx, token) in enumerate(vocab_list)])
-        return vocab_dict, vocab_list
+        token2id = dict(
            [(token, idx) for (idx, token) in enumerate(vocab_list)])
        return token2id, id2token, vocab_list
--- a/deepspeech/frontend/normalizer.py
+++ b/deepspeech/frontend/normalizer.py
@ -12,11 +12,68 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 """Contains feature normalizers."""
 import json
 import numpy as np
-import random
+import paddle
-from deepspeech.frontend.utility import read_manifest
+from paddle.io import DataLoader
 from paddle.io import Dataset
 from deepspeech.frontend.audio import AudioSegment
 from deepspeech.frontend.utility import load_cmvn
 from deepspeech.frontend.utility import read_manifest
 from deepspeech.utils.log import Log
 __all__ = ["FeatureNormalizer"]
 logger = Log(__name__).getlog()
 # https://github.com/PaddlePaddle/Paddle/pull/31481
 class CollateFunc(object):
    def __init__(self, feature_func):
        self.feature_func = feature_func
    def __call__(self, batch):
        mean_stat = None
        var_stat = None
        number = 0
        for item in batch:
            audioseg = AudioSegment.from_file(item['feat'])
            feat = self.feature_func(audioseg)  #(D, T)
            sums = np.sum(feat, axis=1)
            if mean_stat is None:
                mean_stat = sums
            else:
                mean_stat += sums
            square_sums = np.sum(np.square(feat), axis=1)
            if var_stat is None:
                var_stat = square_sums
            else:
                var_stat += square_sums
            number += feat.shape[1]
        return number, mean_stat, var_stat
 class AudioDataset(Dataset):
    def __init__(self, manifest_path, num_samples=-1, rng=None, random_seed=0):
        self._rng = rng if rng else np.random.RandomState(random_seed)
        manifest = read_manifest(manifest_path)
        if num_samples == -1:
            sampled_manifest = manifest
        else:
            sampled_manifest = self._rng.choice(
                manifest, num_samples, replace=False)
        self.items = sampled_manifest
    def __len__(self):
        return len(self.items)
    def __getitem__(self, idx):
        return self.items[idx]
 class FeatureNormalizer(object):
@ -47,27 +104,35 @@ class FeatureNormalizer(object):
                 manifest_path=None,
                 featurize_func=None,
                 num_samples=500,
                 num_workers=0,
                 random_seed=0):
        if not mean_std_filepath:
            if not (manifest_path and featurize_func):
                raise ValueError("If mean_std_filepath is None, meanifest_path "
                                 "and featurize_func should not be None.")
-            self._rng = random.Random(random_seed)
+            self._rng = np.random.RandomState(random_seed)
-            self._compute_mean_std(manifest_path, featurize_func, num_samples)
+            self._compute_mean_std(manifest_path, featurize_func, num_samples,
                                   num_workers)
        else:
            self._read_mean_std_from_file(mean_std_filepath)
-    def apply(self, features, eps=1e-14):
+    def apply(self, features):
        """Normalize features to be of zero mean and unit stddev.
        :param features: Input features to be normalized.
-        :type features: ndarray
+        :type features: ndarray, shape (D, T)
        :param eps:  added to stddev to provide numerical stablibity.
        :type eps: float
        :return: Normalized features.
        :rtype: ndarray
        """
-        return (features - self._mean) / (self._std + eps)
+        return (features - self._mean) * self._istd
    def _read_mean_std_from_file(self, filepath, eps=1e-20):
        """Load mean and std from file."""
        mean, istd = load_cmvn(filepath, filetype='json')
        self._mean = np.expand_dims(mean, axis=-1)
        self._istd = np.expand_dims(istd, axis=-1)
    def write_to_file(self, filepath):
        """Write the mean and stddev to the file.
@ -75,23 +140,52 @@ class FeatureNormalizer(object):
        :param filepath: File to write mean and stddev.
        :type filepath: str
        """
-        np.savez(filepath, mean=self._mean, std=self._std)
+        with open(filepath, 'w') as fout:
-
+            fout.write(json.dumps(self.cmvn_info))
    def _read_mean_std_from_file(self, filepath):
        """Load mean and std from file."""
        npzfile = np.load(filepath)
        self._mean = npzfile["mean"]
        self._std = npzfile["std"]
-    def _compute_mean_std(self, manifest_path, featurize_func, num_samples):
+    def _compute_mean_std(self,
                          manifest_path,
                          featurize_func,
                          num_samples,
                          num_workers,
                          batch_size=64,
                          eps=1e-20):
        """Compute mean and std from randomly sampled instances."""
-        manifest = read_manifest(manifest_path)
+        paddle.set_device('cpu')
-        sampled_manifest = self._rng.sample(manifest, num_samples)
+
-        features = []
+        collate_func = CollateFunc(featurize_func)
-        for instance in sampled_manifest:
+        dataset = AudioDataset(manifest_path, num_samples, self._rng)
-            features.append(
+        data_loader = DataLoader(
-                featurize_func(
+            dataset,
-                    AudioSegment.from_file(instance["audio_filepath"])))
+            batch_size=batch_size,
-        features = np.hstack(features)
+            shuffle=False,
-        self._mean = np.mean(features, axis=1).reshape([-1, 1])
+            num_workers=num_workers,
-        self._std = np.std(features, axis=1).reshape([-1, 1])
+            collate_fn=collate_func)
        with paddle.no_grad():
            all_mean_stat = None
            all_var_stat = None
            all_number = 0
            wav_number = 0
            for i, batch in enumerate(data_loader):
                number, mean_stat, var_stat = batch
                if i == 0:
                    all_mean_stat = mean_stat
                    all_var_stat = var_stat
                else:
                    all_mean_stat += mean_stat
                    all_var_stat += var_stat
                all_number += number
                wav_number += batch_size
                if wav_number % 1000 == 0:
                    logger.info('process {} wavs,{} frames'.format(wav_number,
                                                                   all_number))
        self.cmvn_info = {
            'mean_stat': list(all_mean_stat.tolist()),
            'var_stat': list(all_var_stat.tolist()),
            'frame_num': all_number,
        }
        return self.cmvn_info
--- a/deepspeech/frontend/speech.py
+++ b/deepspeech/frontend/speech.py
@ -12,8 +12,8 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 """Contains the speech segment class."""
 import numpy as np
 from deepspeech.frontend.audio import AudioSegment
@ -24,7 +24,12 @@ class SpeechSegment(AudioSegment):
        AudioSegment (AudioSegment): Audio Segment
    """
-    def __init__(self, samples, sample_rate, transcript):
+    def __init__(self,
                 samples,
                 sample_rate,
                 transcript,
                 tokens=None,
                 token_ids=None):
        """Speech segment abstraction, a subclass of AudioSegment,
            with an additional transcript.
@ -32,9 +37,14 @@ class SpeechSegment(AudioSegment):
            samples (ndarray.float32): Audio samples [num_samples x num_channels].
            sample_rate (int): Audio sample rate.
            transcript (str): Transcript text for the speech.
            tokens (List[str], optinal): Transcript tokens for the speech.
            token_ids (List[int], optional): Transcript token ids for the speech.
        """
        AudioSegment.__init__(self, samples, sample_rate)
        self._transcript = transcript
        # must init `tokens` with `token_ids` at the same time
        self._tokens = tokens
        self._token_ids = token_ids
    def __eq__(self, other):
        """Return whether two objects are equal.
@ -46,6 +56,11 @@ class SpeechSegment(AudioSegment):
            return False
        if self._transcript != other._transcript:
            return False
        if self.has_token and other.has_token:
            if self._tokens != other._tokens:
                return False
            if self._token_ids != other._token_ids:
                return False
        return True
    def __ne__(self, other):
@ -53,33 +68,39 @@ class SpeechSegment(AudioSegment):
        return not self.__eq__(other)
    @classmethod
-    def from_file(cls, filepath, transcript):
+    def from_file(cls, filepath, transcript, tokens=None, token_ids=None):
        """Create speech segment from audio file and corresponding transcript.
-        
+
-        :param filepath: Filepath or file object to audio file.
+        Args:
-        :type filepath: str|file
+            filepath (str|file): Filepath or file object to audio file.
-        :param transcript: Transcript text for the speech.
+            transcript (str): Transcript text for the speech.
-        :type transript: str
+            tokens (List[str], optional): text tokens. Defaults to None.
-        :return: Speech segment instance.
+            token_ids (List[int], optional): text token ids. Defaults to None.
-        :rtype: SpeechSegment
+
        Returns:
            SpeechSegment: Speech segment instance.
        """
        audio = AudioSegment.from_file(filepath)
-        return cls(audio.samples, audio.sample_rate, transcript)
+        return cls(audio.samples, audio.sample_rate, transcript, tokens,
                   token_ids)
    @classmethod
-    def from_bytes(cls, bytes, transcript):
+    def from_bytes(cls, bytes, transcript, tokens=None, token_ids=None):
        """Create speech segment from a byte string and corresponding
-        transcript.
+
-        
+        Args:
-        :param bytes: Byte string containing audio samples.
+            filepath (str|file): Filepath or file object to audio file.
-        :type bytes: str
+            transcript (str): Transcript text for the speech.
-        :param transcript: Transcript text for the speech.
+            tokens (List[str], optional): text tokens. Defaults to None.
-        :type transript: str
+            token_ids (List[int], optional): text token ids. Defaults to None.
-        :return: Speech segment instance.
+
-        :rtype: Speech Segment
+        Returns:
            SpeechSegment: Speech segment instance.
        """
        audio = AudioSegment.from_bytes(bytes)
-        return cls(audio.samples, audio.sample_rate, transcript)
+        return cls(audio.samples, audio.sample_rate, transcript, tokens,
                   token_ids)
    @classmethod
    def concatenate(cls, *segments):
@ -98,6 +119,8 @@ class SpeechSegment(AudioSegment):
            raise ValueError("No speech segments are given to concatenate.")
        sample_rate = segments[0]._sample_rate
        transcripts = ""
        tokens = []
        token_ids = []
        for seg in segments:
            if sample_rate != seg._sample_rate:
                raise ValueError("Can't concatenate segments with "
@ -106,11 +129,20 @@ class SpeechSegment(AudioSegment):
                raise TypeError("Only speech segments of the same type "
                                "instance can be concatenated.")
            transcripts += seg._transcript
            if self.has_token:
                tokens += seg._tokens
                token_ids += seg._token_ids
        samples = np.concatenate([seg.samples for seg in segments])
-        return cls(samples, sample_rate, transcripts)
+        return cls(samples, sample_rate, transcripts, tokens, token_ids)
    @classmethod
-    def slice_from_file(cls, filepath, transcript, start=None, end=None):
+    def slice_from_file(cls,
                        filepath,
                        transcript,
                        tokens=None,
                        token_ids=None,
                        start=None,
                        end=None):
        """Loads a small section of an speech without having to load
        the entire file into the memory which can be incredibly wasteful.
@ -132,28 +164,54 @@ class SpeechSegment(AudioSegment):
        :rtype: SpeechSegment
        """
        audio = AudioSegment.slice_from_file(filepath, start, end)
-        return cls(audio.samples, audio.sample_rate, transcript)
+        return cls(audio.samples, audio.sample_rate, transcript, tokens,
                   token_ids)
    @classmethod
    def make_silence(cls, duration, sample_rate):
        """Creates a silent speech segment of the given duration and
        sample rate, transcript will be an empty string.
-        :param duration: Length of silence in seconds.
+        Args:
-        :type duration: float
+            duration (float): Length of silence in seconds.
-        :param sample_rate: Sample rate.
+            sample_rate (float): Sample rate.
-        :type sample_rate: float
+
-        :return: Silence of the given duration.
+        Returns:
-        :rtype: SpeechSegment
+            SpeechSegment: Silence of the given duration.
        """
        audio = AudioSegment.make_silence(duration, sample_rate)
        return cls(audio.samples, audio.sample_rate, "")
    @property
    def has_token(self):
        if self._tokens and self._token_ids:
            return True
        return False
    @property
    def transcript(self):
        """Return the transcript text.
-        :return: Transcript text for the speech.
+        Returns:
-        :rtype: str
+            str: Transcript text for the speech.
        """
        return self._transcript
    @property
    def tokens(self):
        """Return the transcript text tokens.
        Returns:
            List[str]: text tokens.
        """
        return self._tokens
    @property
    def token_ids(self):
        """Return the transcript text token ids.
        Returns:
            List[int]: text token ids.
        """
        return self._token_ids
--- a/deepspeech/frontend/utility.py
+++ b/deepspeech/frontend/utility.py
@ -12,41 +12,248 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 """Contains data helper functions."""
 import json
 import codecs
-import os
+import json
-import tarfile
+import math
-import time
+
-from threading import Thread
+import numpy as np
 from multiprocessing import Process, Manager, Value
-from paddle.dataset.common import md5file
+from deepspeech.utils.log import Log
 logger = Log(__name__).getlog()
-def read_manifest(manifest_path, max_duration=float('inf'), min_duration=0.0):
+__all__ = [
    "load_cmvn", "read_manifest", "rms_to_db", "rms_to_dbfs", "max_dbfs",
    "mean_dbfs", "gain_db_to_ratio", "normalize_audio", "SOS", "EOS", "UNK",
    "BLANK"
 ]
 IGNORE_ID = -1
 SOS = "<sos/eos>"
 EOS = SOS
 UNK = "<unk>"
 BLANK = "<blank>"
 def read_manifest(
        manifest_path,
        max_input_len=float('inf'),
        min_input_len=0.0,
        max_output_len=float('inf'),
        min_output_len=0.0,
        max_output_input_ratio=float('inf'),
        min_output_input_ratio=0.0, ):
    """Load and parse manifest file.
-    Instances with durations outside [min_duration, max_duration] will be
+    Args:
-    filtered out.
+        manifest_path ([type]): Manifest file to load and parse.
        max_input_len ([type], optional): maximum output seq length, in seconds for raw wav, in frame numbers for feature data. Defaults to float('inf').
        min_input_len (float, optional): minimum input seq length, in seconds for raw wav, in frame numbers for feature data. Defaults to 0.0.
        max_output_len (float, optional): maximum input seq length, in modeling units. Defaults to 500.0.
        min_output_len (float, optional): minimum input seq length, in modeling units. Defaults to 0.0.
        max_output_input_ratio (float, optional): maximum output seq length/output seq length ratio. Defaults to 10.0.
        min_output_input_ratio (float, optional): minimum output seq length/output seq length ratio. Defaults to 0.05.
    Raises:
        IOError: If failed to parse the manifest.
-    :param manifest_path: Manifest file to load and parse.
+    Returns:
-    :type manifest_path: str
+        List[dict]: Manifest parsing results.
    :param max_duration: Maximal duration in seconds for instance filter.
    :type max_duration: float
    :param min_duration: Minimal duration in seconds for instance filter.
    :type min_duration: float
    :return: Manifest parsing results. List of dict.
    :rtype: list
    :raises IOError: If failed to parse the manifest.
    """
    manifest = []
    for json_line in codecs.open(manifest_path, 'r', 'utf-8'):
        try:
            json_data = json.loads(json_line)
        except Exception as e:
            raise IOError("Error reading manifest: %s" % str(e))
-        if (json_data["duration"] <= max_duration and
+
-                json_data["duration"] >= min_duration):
+        feat_len = json_data["feat_shape"][
            0] if 'feat_shape' in json_data else 1.0
        token_len = json_data["token_shape"][
            0] if 'token_shape' in json_data else 1.0
        conditions = [
            feat_len >= min_input_len,
            feat_len <= max_input_len,
            token_len >= min_output_len,
            token_len <= max_output_len,
            token_len / feat_len >= min_output_input_ratio,
            token_len / feat_len <= max_output_input_ratio,
        ]
        if all(conditions):
            manifest.append(json_data)
    return manifest
 def rms_to_db(rms: float):
    """Root Mean Square to dB.
    Args:
        rms ([float]): root mean square
    Returns:
        float: dB
    """
    return 20.0 * math.log10(max(1e-16, rms))
 def rms_to_dbfs(rms: float):
    """Root Mean Square to dBFS.
    https://fireattack.wordpress.com/2017/02/06/replaygain-loudness-normalization-and-applications/
    Audio is mix of sine wave, so 1 amp sine wave's Full scale is 0.7071, equal to -3.0103dB.
    dB = dBFS + 3.0103
    dBFS = db - 3.0103
    e.g. 0 dB = -3.0103 dBFS
    Args:
        rms ([float]): root mean square
    Returns:
        float: dBFS
    """
    return rms_to_db(rms) - 3.0103
 def max_dbfs(sample_data: np.ndarray):
    """Peak dBFS based on the maximum energy sample. 
    Args:
        sample_data ([np.ndarray]): float array, [-1, 1].
    Returns:
        float: dBFS 
    """
    # Peak dBFS based on the maximum energy sample. Will prevent overdrive if used for normalization.
    return rms_to_dbfs(max(abs(np.min(sample_data)), abs(np.max(sample_data))))
 def mean_dbfs(sample_data):
    """Peak dBFS based on the RMS energy. 
    Args:
        sample_data ([np.ndarray]): float array, [-1, 1].
    Returns:
        float: dBFS 
    """
    return rms_to_dbfs(
        math.sqrt(np.mean(np.square(sample_data, dtype=np.float64))))
 def gain_db_to_ratio(gain_db: float):
    """dB to ratio
    Args:
        gain_db (float): gain in dB
    Returns:
        float: scale in amp
    """
    return math.pow(10.0, gain_db / 20.0)
 def normalize_audio(sample_data: np.ndarray, dbfs: float=-3.0103):
    """Nomalize audio to dBFS.
    Args:
        sample_data (np.ndarray): input wave samples, [-1, 1].
        dbfs (float, optional): target dBFS. Defaults to -3.0103.
    Returns:
        np.ndarray: normalized wave
    """
    return np.maximum(
        np.minimum(sample_data * gain_db_to_ratio(dbfs - max_dbfs(sample_data)),
                   1.0), -1.0)
 def _load_json_cmvn(json_cmvn_file):
    """ Load the json format cmvn stats file and calculate cmvn
    Args:
        json_cmvn_file: cmvn stats file in json format
    Returns:
        a numpy array of [means, vars]
    """
    with open(json_cmvn_file) as f:
        cmvn_stats = json.load(f)
    means = cmvn_stats['mean_stat']
    variance = cmvn_stats['var_stat']
    count = cmvn_stats['frame_num']
    for i in range(len(means)):
        means[i] /= count
        variance[i] = variance[i] / count - means[i] * means[i]
        if variance[i] < 1.0e-20:
            variance[i] = 1.0e-20
        variance[i] = 1.0 / math.sqrt(variance[i])
    cmvn = np.array([means, variance])
    return cmvn
 def _load_kaldi_cmvn(kaldi_cmvn_file):
    """ Load the kaldi format cmvn stats file and calculate cmvn
    Args:
        kaldi_cmvn_file:  kaldi text style global cmvn file, which
           is generated by:
           compute-cmvn-stats --binary=false scp:feats.scp global_cmvn
    Returns:
        a numpy array of [means, vars]
    """
    means = []
    variance = []
    with open(kaldi_cmvn_file, 'r') as fid:
        # kaldi binary file start with '\0B'
        if fid.read(2) == '\0B':
            logger.error('kaldi cmvn binary file is not supported, please '
                         'recompute it by: compute-cmvn-stats --binary=false '
                         ' scp:feats.scp global_cmvn')
            sys.exit(1)
        fid.seek(0)
        arr = fid.read().split()
        assert (arr[0] == '[')
        assert (arr[-2] == '0')
        assert (arr[-1] == ']')
        feat_dim = int((len(arr) - 2 - 2) / 2)
        for i in range(1, feat_dim + 1):
            means.append(float(arr[i]))
        count = float(arr[feat_dim + 1])
        for i in range(feat_dim + 2, 2 * feat_dim + 2):
            variance.append(float(arr[i]))
    for i in range(len(means)):
        means[i] /= count
        variance[i] = variance[i] / count - means[i] * means[i]
        if variance[i] < 1.0e-20:
            variance[i] = 1.0e-20
        variance[i] = 1.0 / math.sqrt(variance[i])
    cmvn = np.array([means, variance])
    return cmvn
 def load_cmvn(cmvn_file: str, filetype: str):
    """load cmvn from file.
    Args:
        cmvn_file (str): cmvn path.
        filetype (str): file type, optional[npz, json, kaldi].
    Raises:
        ValueError: file type not support.
    Returns:
        Tuple[np.ndarray, np.ndarray]: mean, istd
    """
    assert filetype in ['npz', 'json', 'kaldi'], filetype
    filetype = filetype.lower()
    if filetype == "json":
        cmvn = _load_json_cmvn(cmvn_file)
    elif filetype == "kaldi":
        cmvn = _load_kaldi_cmvn(cmvn_file)
    else:
        raise ValueError(f"cmvn file type no support: {filetype}")
    return cmvn[0], cmvn[1]
--- a/deepspeech/io/init.py
+++ b/deepspeech/io/init.py
@ -11,25 +11,33 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
-
+import numpy as np
 from paddle.io import DataLoader
 from deepspeech.io.collator import SpeechCollator
 from deepspeech.io.sampler import SortagradDistributedBatchSampler
 from deepspeech.io.sampler import SortagradBatchSampler
 from deepspeech.io.dataset import ManifestDataset
 from deepspeech.io.sampler import SortagradBatchSampler
 from deepspeech.io.sampler import SortagradDistributedBatchSampler
 def create_dataloader(manifest_path,
                      unit_type,
                      vocab_filepath,
                      mean_std_filepath,
                      spm_model_prefix,
                      augmentation_config='{}',
-                      max_duration=float('inf'),
+                      max_input_len=float('inf'),
-                      min_duration=0.0,
+                      min_input_len=0.0,
                      max_output_len=float('inf'),
                      min_output_len=0.0,
                      max_output_input_ratio=float('inf'),
                      min_output_input_ratio=0.0,
                      stride_ms=10.0,
                      window_ms=20.0,
                      max_freq=None,
                      specgram_type='linear',
                      feat_dim=None,
                      delta_delta=False,
                      use_dB_normalization=True,
                      random_seed=0,
                      keep_transcription_text=False,
@ -41,16 +49,24 @@ def create_dataloader(manifest_path,
                      dist=False):
    dataset = ManifestDataset(
-        manifest_path,
+        manifest_path=manifest_path,
-        vocab_filepath,
+        unit_type=unit_type,
-        mean_std_filepath,
+        vocab_filepath=vocab_filepath,
        mean_std_filepath=mean_std_filepath,
        spm_model_prefix=spm_model_prefix,
        augmentation_config=augmentation_config,
-        max_duration=max_duration,
+        max_input_len=max_input_len,
-        min_duration=min_duration,
+        min_input_len=min_input_len,
        max_output_len=max_output_len,
        min_output_len=min_output_len,
        max_output_input_ratio=max_output_input_ratio,
        min_output_input_ratio=min_output_input_ratio,
        stride_ms=stride_ms,
        window_ms=window_ms,
        max_freq=max_freq,
        specgram_type=specgram_type,
        feat_dim=feat_dim,
        delta_delta=delta_delta,
        use_dB_normalization=use_dB_normalization,
        random_seed=random_seed,
        keep_transcription_text=keep_transcription_text)
@ -74,7 +90,10 @@ def create_dataloader(manifest_path,
            sortagrad=is_training,
            shuffle_method=shuffle_method)
-    def padding_batch(batch, padding_to=-1, flatten=False, is_training=True):
+    def padding_batch(batch,
                      padding_to=-1,
                      flatten=False,
                      keep_transcription_text=True):
        """	
        Padding audio features with zeros to make them have the same shape (or	
        a user-defined shape) within one bach.	
@ -107,10 +126,10 @@ def create_dataloader(manifest_path,
            audio_lens.append(audio.shape[1])
            padded_text = np.zeros([max_text_length])
-            if is_training:
+            if keep_transcription_text:
                padded_text[:len(text)] = text  #ids
            else:
                padded_text[:len(text)] = [ord(t) for t in text]  # string
            else:
                padded_text[:len(text)] = text  # ids
            texts.append(padded_text)
            text_lens.append(len(text))
@ -118,11 +137,13 @@ def create_dataloader(manifest_path,
        audio_lens = np.array(audio_lens).astype('int64')
        texts = np.array(texts).astype('int32')
        text_lens = np.array(text_lens).astype('int64')
-        return padded_audios, texts, audio_lens, text_lens
+        return padded_audios, audio_lens, texts, text_lens
    # collate_fn=functools.partial(padding_batch, keep_transcription_text=keep_transcription_text),
    collate_fn = SpeechCollator(keep_transcription_text=keep_transcription_text)
    loader = DataLoader(
        dataset,
        batch_sampler=batch_sampler,
-        collate_fn=partial(padding_batch, is_training=is_training),
+        collate_fn=collate_fn,
        num_workers=num_workers)
    return loader
--- a/deepspeech/io/collator.py
+++ b/deepspeech/io/collator.py
@ -11,63 +11,68 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 import logging
 import numpy as np
 from collections import namedtuple
-logger = logging.getLogger(__name__)
+from deepspeech.frontend.utility import IGNORE_ID
 from deepspeech.io.utility import pad_sequence
 from deepspeech.utils.log import Log
 __all__ = ["SpeechCollator"]
-__all__ = [
+logger = Log(__name__).getlog()
    "SpeechCollator",
 ]
 class SpeechCollator():
-    def __init__(self, padding_to=-1, is_training=True):
+    def __init__(self, keep_transcription_text=True):
        """
        Padding audio features with zeros to make them have the same shape (or
        a user-defined shape) within one bach.
-        If ``padding_to`` is -1, the maximun shape in the batch will be used
+        if ``keep_transcription_text`` is False, text is token ids else is raw string.
        as the target shape for padding. Otherwise, `padding_to` will be the
        target shape (only refers to the second axis).
        """
-        self._padding_to = padding_to
+        self._keep_transcription_text = keep_transcription_text
        self._is_training = is_training
    def __call__(self, batch):
-        new_batch = []
+        """batch examples
-        # get target shape
+
-        max_length = max([audio.shape[1] for audio, _ in batch])
+        Args:
-        if self._padding_to != -1:
+            batch ([List]): batch is (audio, text)
-            if self._padding_to < max_length:
+                audio (np.ndarray) shape (D, T)
-                raise ValueError("If padding_to is not -1, it should be larger "
+                text (List[int] or str): shape (U,)
-                                 "than any instance's shape in the batch")
+
-            max_length = self._padding_to
+        Returns:
-        max_text_length = max([len(text) for _, text in batch])
+            tuple(audio, text, audio_lens, text_lens): batched data.
-        # padding
+                audio : (B, Tmax, D)
-        padded_audios = []
+                audio_lens: (B)
                text : (B, Umax)
                text_lens: (B)
        """
        audios = []
        audio_lens = []
-        texts, text_lens = [], []
+        texts = []
        text_lens = []
        for audio, text in batch:
            # audio
-            padded_audio = np.zeros([audio.shape[0], max_length])
+            audios.append(audio.T)  # [T, D]
            padded_audio[:, :audio.shape[1]] = audio
            padded_audios.append(padded_audio)
            audio_lens.append(audio.shape[1])
            # text
-            padded_text = np.zeros([max_text_length])
+            # for training, text is token ids
-            if self._is_training:
+            # else text is string, convert to unicode ord
-                padded_text[:len(text)] = text  # token ids
+            tokens = []
            if self._keep_transcription_text:
                assert isinstance(text, str), (type(text), text)
                tokens = [ord(t) for t in text]
            else:
-                padded_text[:len(text)] = [ord(t)
+                tokens = text  # token ids
-                                           for t in text]  # string, unicode ord
+            tokens = tokens if isinstance(tokens, np.ndarray) else np.array(
-            texts.append(padded_text)
+                tokens, dtype=np.int64)
-            text_lens.append(len(text))
+            texts.append(tokens)
            text_lens.append(tokens.shape[0])
-        padded_audios = np.array(padded_audios).astype('float32')
+        padded_audios = pad_sequence(
-        audio_lens = np.array(audio_lens).astype('int64')
+            audios, padding_value=0.0).astype(np.float32)  #[B, T, D]
-        texts = np.array(texts).astype('int32')
+        audio_lens = np.array(audio_lens).astype(np.int64)
-        text_lens = np.array(text_lens).astype('int64')
+        padded_texts = pad_sequence(
-        return padded_audios, texts, audio_lens, text_lens
+            texts, padding_value=IGNORE_ID).astype(np.int64)
        text_lens = np.array(text_lens).astype(np.int64)
        return padded_audios, audio_lens, padded_texts, text_lens
--- a/deepspeech/io/dataset.py
+++ b/deepspeech/io/dataset.py
@ -11,44 +11,151 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
-
+import io
 import math
 import random
 import tarfile
-import logging
+import time
 import numpy as np
 from collections import namedtuple
-from functools import partial
+from typing import Optional
 import numpy as np
 from paddle.io import Dataset
 from yacs.config import CfgNode
 from deepspeech.frontend.utility import read_manifest
 from deepspeech.frontend.augmentor.augmentation import AugmentationPipeline
 from deepspeech.frontend.featurizer.speech_featurizer import SpeechFeaturizer
 from deepspeech.frontend.speech import SpeechSegment
 from deepspeech.frontend.normalizer import FeatureNormalizer
-
+from deepspeech.frontend.speech import SpeechSegment
-logger = logging.getLogger(__name__)
+from deepspeech.frontend.utility import read_manifest
 from deepspeech.utils.log import Log
 __all__ = [
    "ManifestDataset",
 ]
 logger = Log(__name__).getlog()
 # namedtupe need global for pickle.
 TarLocalData = namedtuple('TarLocalData', ['tar2info', 'tar2object'])
 class ManifestDataset(Dataset):
    @classmethod
    def params(cls, config: Optional[CfgNode]=None) -> CfgNode:
        default = CfgNode(
            dict(
                train_manifest="",
                dev_manifest="",
                test_manifest="",
                manifest="",
                unit_type="char",
                vocab_filepath="",
                spm_model_prefix="",
                mean_std_filepath="",
                augmentation_config="",
                max_input_len=27.0,
                min_input_len=0.0,
                max_output_len=float('inf'),
                min_output_len=0.0,
                max_output_input_ratio=float('inf'),
                min_output_input_ratio=0.0,
                stride_ms=10.0,  # ms
                window_ms=20.0,  # ms
                n_fft=None,  # fft points
                max_freq=None,  # None for samplerate/2
                raw_wav=True,  # use raw_wav or kaldi feature
                specgram_type='linear',  # 'linear', 'mfcc', 'fbank'
                feat_dim=0,  # 'mfcc', 'fbank'
                delta_delta=False,  # 'mfcc', 'fbank'
                dither=1.0,  # feature dither
                target_sample_rate=16000,  # target sample rate
                use_dB_normalization=True,
                target_dB=-20,
                random_seed=0,
                keep_transcription_text=False,
                batch_size=32,  # batch size
                num_workers=0,  # data loader workers
                sortagrad=False,  # sorted in first epoch when True
                shuffle_method="batch_shuffle",  # 'batch_shuffle', 'instance_shuffle'
            ))
        if config is not None:
            config.merge_from_other_cfg(default)
        return default
    @classmethod
    def from_config(cls, config):
        """Build a ManifestDataset object from a config.
        Args:
            config (yacs.config.CfgNode): configs object.
        Returns:
            ManifestDataset: dataet object.
        """
        assert 'manifest' in config.data
        assert config.data.manifest
        assert 'keep_transcription_text' in config.data
        if isinstance(config.data.augmentation_config, (str, bytes)):
            if config.data.augmentation_config:
                aug_file = io.open(
                    config.data.augmentation_config, mode='r', encoding='utf8')
            else:
                aug_file = io.StringIO(initial_value='{}', newline='')
        else:
            aug_file = config.data.augmentation_config
            assert isinstance(aug_file, io.StringIO)
        dataset = cls(
            manifest_path=config.data.manifest,
            unit_type=config.data.unit_type,
            vocab_filepath=config.data.vocab_filepath,
            mean_std_filepath=config.data.mean_std_filepath,
            spm_model_prefix=config.data.spm_model_prefix,
            augmentation_config=aug_file.read(),
            max_input_len=config.data.max_input_len,
            min_input_len=config.data.min_input_len,
            max_output_len=config.data.max_output_len,
            min_output_len=config.data.min_output_len,
            max_output_input_ratio=config.data.max_output_input_ratio,
            min_output_input_ratio=config.data.min_output_input_ratio,
            stride_ms=config.data.stride_ms,
            window_ms=config.data.window_ms,
            n_fft=config.data.n_fft,
            max_freq=config.data.max_freq,
            target_sample_rate=config.data.target_sample_rate,
            specgram_type=config.data.specgram_type,
            feat_dim=config.data.feat_dim,
            delta_delta=config.data.delta_delta,
            dither=config.data.dither,
            use_dB_normalization=config.data.use_dB_normalization,
            target_dB=config.data.target_dB,
            random_seed=config.data.random_seed,
            keep_transcription_text=config.data.keep_transcription_text)
        return dataset
    def __init__(self,
                 manifest_path,
                 unit_type,
                 vocab_filepath,
                 mean_std_filepath,
                 spm_model_prefix=None,
                 augmentation_config='{}',
-                 max_duration=float('inf'),
+                 max_input_len=float('inf'),
-                 min_duration=0.0,
+                 min_input_len=0.0,
                 max_output_len=float('inf'),
                 min_output_len=0.0,
                 max_output_input_ratio=float('inf'),
                 min_output_input_ratio=0.0,
                 stride_ms=10.0,
                 window_ms=20.0,
                 n_fft=None,
                 max_freq=None,
                 target_sample_rate=16000,
                 specgram_type='linear',
                 feat_dim=None,
                 delta_delta=False,
                 dither=1.0,
                 use_dB_normalization=True,
                 target_dB=-20,
                 random_seed=0,
@ -57,52 +164,69 @@ class ManifestDataset(Dataset):
        Args:
            manifest_path (str): manifest josn file path
-            vocab_filepath (str): vocab file path
+            unit_type(str): token unit type, e.g. char, word, spm
            vocab_filepath (str): vocab file path.
            mean_std_filepath (str): mean and std file path, which suffix is *.npy
            spm_model_prefix (str): spm model prefix, need if `unit_type` is spm.
            augmentation_config (str, optional): augmentation json str. Defaults to '{}'.
-            max_duration (float, optional): audio length in seconds must less than this. Defaults to float('inf').
+            max_input_len ([type], optional): maximum output seq length, in seconds for raw wav, in frame numbers for feature data. Defaults to float('inf').
-            min_duration (float, optional): audio length is seconds must greater than this. Defaults to 0.0.
+            min_input_len (float, optional): minimum input seq length, in seconds for raw wav, in frame numbers for feature data. Defaults to 0.0.
            max_output_len (float, optional): maximum input seq length, in modeling units. Defaults to 500.0.
            min_output_len (float, optional): minimum input seq length, in modeling units. Defaults to 0.0.
            max_output_input_ratio (float, optional): maximum output seq length/output seq length ratio. Defaults to 10.0.
            min_output_input_ratio (float, optional): minimum output seq length/output seq length ratio. Defaults to 0.05.
            stride_ms (float, optional): stride size in ms. Defaults to 10.0.
            window_ms (float, optional): window size in ms. Defaults to 20.0.
            n_fft (int, optional): fft points for rfft. Defaults to None.
            max_freq (int, optional): max cut freq. Defaults to None.
            target_sample_rate (int, optional): target sample rate which used for training. Defaults to 16000.
-            specgram_type (str, optional): 'linear' or 'mfcc'. Defaults to 'linear'.
+            specgram_type (str, optional): 'linear', 'mfcc' or 'fbank'. Defaults to 'linear'.
            feat_dim (int, optional): audio feature dim, using by 'mfcc' or 'fbank'. Defaults to None.
            delta_delta (bool, optional): audio feature with delta-delta, using by 'fbank' or 'mfcc'. Defaults to False.
            use_dB_normalization (bool, optional): do dB normalization. Defaults to True.
            target_dB (int, optional): target dB. Defaults to -20.
            random_seed (int, optional): for random generator. Defaults to 0.
            keep_transcription_text (bool, optional): True, when not in training mode, will not do tokenizer; Defaults to False.
        """
        super().__init__()
        self._stride_ms = stride_ms
        self._target_sample_rate = target_sample_rate
-        self._max_duration = max_duration
+        self._normalizer = FeatureNormalizer(
-        self._min_duration = min_duration
+            mean_std_filepath) if mean_std_filepath else None
        self._normalizer = FeatureNormalizer(mean_std_filepath)
        self._augmentation_pipeline = AugmentationPipeline(
            augmentation_config=augmentation_config, random_seed=random_seed)
        self._speech_featurizer = SpeechFeaturizer(
            unit_type=unit_type,
            vocab_filepath=vocab_filepath,
            spm_model_prefix=spm_model_prefix,
            specgram_type=specgram_type,
            feat_dim=feat_dim,
            delta_delta=delta_delta,
            stride_ms=stride_ms,
            window_ms=window_ms,
            n_fft=n_fft,
            max_freq=max_freq,
            target_sample_rate=target_sample_rate,
            use_dB_normalization=use_dB_normalization,
-            target_dB=target_dB)
+            target_dB=target_dB,
-        self._rng = random.Random(random_seed)
+            dither=dither)
        self._rng = np.random.RandomState(random_seed)
        self._keep_transcription_text = keep_transcription_text
        # for caching tar files info
-        self._local_data = namedtuple('local_data', ['tar2info', 'tar2object'])
+        self._local_data = TarLocalData(tar2info={}, tar2object={})
        self._local_data.tar2info = {}
        self._local_data.tar2object = {}
        # read manifest
        self._manifest = read_manifest(
            manifest_path=manifest_path,
-            max_duration=self._max_duration,
+            max_input_len=max_input_len,
-            min_duration=self._min_duration)
+            min_input_len=min_input_len,
-        self._manifest.sort(key=lambda x: x["duration"])
+            max_output_len=max_output_len,
            min_output_len=min_output_len,
            max_output_input_ratio=max_output_input_ratio,
            min_output_input_ratio=min_output_input_ratio)
        self._manifest.sort(key=lambda x: x["feat_shape"][0])
    @property
    def manifest(self):
@ -110,26 +234,28 @@ class ManifestDataset(Dataset):
    @property
    def vocab_size(self):
        """Return the vocabulary size.
        :return: Vocabulary size.
        :rtype: int
        """
        return self._speech_featurizer.vocab_size
    @property
    def vocab_list(self):
        """Return the vocabulary in list.
        :return: Vocabulary in list.
        :rtype: list
        """
        return self._speech_featurizer.vocab_list
    @property
    def vocab_dict(self):
        return self._speech_featurizer.vocab_dict
    @property
    def text_feature(self):
        return self._speech_featurizer.text_feature
    @property
    def feature_size(self):
        return self._speech_featurizer.feature_size
    @property
    def stride_ms(self):
        return self._speech_featurizer.stride_ms
    def _parse_tar(self, file):
        """Parse a tar file to get a tarfile object
        and a map containing tarinfoes
@ -169,15 +295,34 @@ class ManifestDataset(Dataset):
                 where transcription part could be token ids or text.
        :rtype: tuple of (2darray, list)
        """
        start_time = time.time()
        if isinstance(audio_file, str) and audio_file.startswith('tar:'):
            speech_segment = SpeechSegment.from_file(
                self._subfile_from_tar(audio_file), transcript)
        else:
            speech_segment = SpeechSegment.from_file(audio_file, transcript)
        load_wav_time = time.time() - start_time
        #logger.debug(f"load wav time: {load_wav_time}")
        # audio augment
        start_time = time.time()
        self._augmentation_pipeline.transform_audio(speech_segment)
        audio_aug_time = time.time() - start_time
        #logger.debug(f"audio augmentation time: {audio_aug_time}")
        start_time = time.time()
        specgram, transcript_part = self._speech_featurizer.featurize(
            speech_segment, self._keep_transcription_text)
-        specgram = self._normalizer.apply(specgram)
+        if self._normalizer:
            specgram = self._normalizer.apply(specgram)
        feature_time = time.time() - start_time
        #logger.debug(f"audio & test feature time: {feature_time}")
        # specgram augment
        start_time = time.time()
        specgram = self._augmentation_pipeline.transform_feature(specgram)
        feature_aug_time = time.time() - start_time
        #logger.debug(f"audio feature augmentation time: {feature_aug_time}")
        return specgram, transcript_part
    def _instance_reader_creator(self, manifest):
@ -191,7 +336,7 @@ class ManifestDataset(Dataset):
        def reader():
            for instance in manifest:
-                inst = self.process_utterance(instance["audio_filepath"],
+                inst = self.process_utterance(instance["feat"],
                                              instance["text"])
                yield inst
@ -202,5 +347,4 @@ class ManifestDataset(Dataset):
    def __getitem__(self, idx):
        instance = self._manifest[idx]
-        return self.process_utterance(instance["audio_filepath"],
+        return self.process_utterance(instance["feat"], instance["text"])
                                      instance["text"])
--- a/deepspeech/io/sampler.py
+++ b/deepspeech/io/sampler.py
@ -11,27 +11,22 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 import math
 import random
 import tarfile
 import logging
 import numpy as np
 from collections import namedtuple
 from functools import partial
-import paddle
+import numpy as np
 from paddle import distributed as dist
 from paddle.io import BatchSampler
 from paddle.io import DistributedBatchSampler
 from paddle import distributed as dist
-logger = logging.getLogger(__name__)
+from deepspeech.utils.log import Log
 __all__ = [
    "SortagradDistributedBatchSampler",
    "SortagradBatchSampler",
 ]
 logger = Log(__name__).getlog()
 def _batch_shuffle(indices, batch_size, epoch, clipped=False):
    """Put similarly-sized instances into minibatches for better efficiency
@ -59,7 +54,7 @@ def _batch_shuffle(indices, batch_size, epoch, clipped=False):
    batch_indices = list(zip(* [iter(indices[shift_len:])] * batch_size))
    rng.shuffle(batch_indices)
    batch_indices = [item for batch in batch_indices for item in batch]
-    assert (clipped == False)
+    assert clipped is False
    if not clipped:
        res_len = len(indices) - shift_len - len(batch_indices)
        # when res_len is 0, will return whole list, len(List[-0:]) = len(List[:])
@ -161,7 +156,7 @@ class SortagradDistributedBatchSampler(DistributedBatchSampler):
        for idx in _sample_iter:
            batch_indices.append(idx)
            if len(batch_indices) == self.batch_size:
-                logger.info(
+                logger.debug(
                    f"rank: {dist.get_rank()} batch index: {batch_indices} ")
                yield batch_indices
                batch_indices = []
@ -195,13 +190,13 @@ class SortagradBatchSampler(BatchSampler):
        self.dataset = dataset
        assert isinstance(batch_size, int) and batch_size > 0, \
-                "batch_size should be a positive integer"
+            "batch_size should be a positive integer"
        self.batch_size = batch_size
        assert isinstance(shuffle, bool), \
-                "shuffle should be a boolean value"
+            "shuffle should be a boolean value"
        self.shuffle = shuffle
        assert isinstance(drop_last, bool), \
-                "drop_last should be a boolean number"
+            "drop_last should be a boolean number"
        self.drop_last = drop_last
        self.epoch = 0
@ -241,7 +236,7 @@ class SortagradBatchSampler(BatchSampler):
        for idx in _sample_iter:
            batch_indices.append(idx)
            if len(batch_indices) == self.batch_size:
-                logger.info(
+                logger.debug(
                    f"rank: {dist.get_rank()} batch index: {batch_indices} ")
                yield batch_indices
                batch_indices = []
--- a/deepspeech/io/utility.py
+++ b/deepspeech/io/utility.py
@ -0,0 +1,82 @@
 # Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 from typing import List
 import numpy as np
 from deepspeech.utils.log import Log
 __all__ = ["pad_sequence"]
 logger = Log(__name__).getlog()
 def pad_sequence(sequences: List[np.ndarray],
                 batch_first: bool=True,
                 padding_value: float=0.0) -> np.ndarray:
    r"""Pad a list of variable length Tensors with ``padding_value``
    ``pad_sequence`` stacks a list of Tensors along a new dimension,
    and pads them to equal length. For example, if the input is list of
    sequences with size ``L x *`` and if batch_first is False, and ``T x B x *``
    otherwise.
    `B` is batch size. It is equal to the number of elements in ``sequences``.
    `T` is length of the longest sequence.
    `L` is length of the sequence.
    `*` is any number of trailing dimensions, including none.
    Example:
        >>> a = np.ones([25, 300])
        >>> b = np.ones([22, 300])
        >>> c = np.ones([15, 300])
        >>> pad_sequence([a, b, c]).shape
        [25, 3, 300]
    Note:
        This function returns a np.ndarray of size ``T x B x *`` or ``B x T x *``
        where `T` is the length of the longest sequence. This function assumes
        trailing dimensions and type of all the Tensors in sequences are same.
    Args:
        sequences (list[np.ndarray]): list of variable length sequences.
        batch_first (bool, optional): output will be in ``B x T x *`` if True, or in
            ``T x B x *`` otherwise
        padding_value (float, optional): value for padded elements. Default: 0.
    Returns:
        np.ndarray of size ``T x B x *`` if :attr:`batch_first` is ``False``.
        np.ndarray of size ``B x T x *`` otherwise
    """
    # assuming trailing dimensions and type of all the Tensors
    # in sequences are same and fetching those from sequences[0]
    max_size = sequences[0].shape
    trailing_dims = max_size[1:]
    max_len = max([s.shape[0] for s in sequences])
    if batch_first:
        out_dims = (len(sequences), max_len) + trailing_dims
    else:
        out_dims = (max_len, len(sequences)) + trailing_dims
    out_tensor = np.full(out_dims, padding_value, dtype=sequences[0].dtype)
    for i, tensor in enumerate(sequences):
        length = tensor.shape[0]
        # use index notation to prevent duplicate references to the tensor
        if batch_first:
            out_tensor[i, :length, ...] = tensor
        else:
            out_tensor[:length, i, ...] = tensor
    return out_tensor
--- a/deepspeech/models/deepspeech2.py
+++ b/deepspeech/models/deepspeech2.py
@ -11,29 +11,21 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
-
+"""Deepspeech2 ASR Model"""
 import math
 import collections
 import numpy as np
 import logging
 from typing import Optional
 from yacs.config import CfgNode
 import paddle
 from paddle import nn
-from paddle.nn import functional as F
+from yacs.config import CfgNode
 from paddle.nn import initializer as I
 from deepspeech.modules.mask import sequence_mask
 from deepspeech.modules.activation import brelu
 from deepspeech.modules.conv import ConvStack
 from deepspeech.modules.rnn import RNNStack
 from deepspeech.modules.ctc import CTCDecoder
-
+from deepspeech.modules.rnn import RNNStack
 from deepspeech.utils import checkpoint
 from deepspeech.utils import layer_tools
 from deepspeech.utils.log import Log
-logger = logging.getLogger(__name__)
+logger = Log(__name__).getlog()
 __all__ = ['DeepSpeech2Model']
@ -67,23 +59,19 @@ class CRNNEncoder(nn.Layer):
        return self.rnn_size * 2
    def forward(self, audio, audio_len):
        """
        audio: shape [B, D, T]
        text: shape [B, T]
        audio_len: shape [B]
        text_len: shape [B]
        """
        """Compute Encoder outputs
        Args:
-            audio (Tensor): [B, D, T]
+            audio (Tensor): [B, Tmax, D]
-            text (Tensor): [B, T]
+            text (Tensor): [B, Umax]
            audio_len (Tensor): [B]
            text_len (Tensor): [B]
        Returns:
            x (Tensor): encoder outputs, [B, T, D]
            x_lens (Tensor): encoder length, [B]
        """
        # [B, T, D]  -> [B, D, T]
        audio = audio.transpose([0, 2, 1])
        # [B, D, T] -> [B, C=1, D, T]
        x = audio.unsqueeze(1)
        x_lens = audio_len
@ -166,26 +154,25 @@ class DeepSpeech2Model(nn.Layer):
        assert (self.encoder.output_size == rnn_size * 2)
        self.decoder = CTCDecoder(
            odim=dict_size,  # <blank> is in  vocab
            enc_n_units=self.encoder.output_size,
-            odim=dict_size + 1,  # <blank> is append after vocab
+            blank_id=0,  # first token is <blank>
            blank_id=dict_size,  # last token is <blank>
            dropout_rate=0.0,
            reduction=True,  # sum
            batch_average=True)  # sum / batch_size
-    def forward(self, audio, text, audio_len, text_len):
+    def forward(self, audio, audio_len, text, text_len):
        """Compute Model loss
        Args:
-            audio (Tenosr): [B, D, T]
+            audio (Tenosr): [B, T, D]
            text (Tensor): [B, T]
            audio_len (Tensor): [B]
            text (Tensor): [B, U]
            text_len (Tensor): [B]
        Returns:
            loss (Tenosr): [1]
        """
        eouts, eouts_len = self.encoder(audio, audio_len)
        loss = self.decoder(eouts, eouts_len, text, text_len)
        return loss
@ -204,7 +191,7 @@ class DeepSpeech2Model(nn.Layer):
            decoding_method=decoding_method)
        eouts, eouts_len = self.encoder(audio, audio_len)
-        probs = self.decoder.probs(eouts)
+        probs = self.decoder.softmax(eouts)
        return self.decoder.decode_probs(
            probs.numpy(), eouts_len, vocab_list, decoding_method,
            lang_model_path, beam_alpha, beam_beta, beam_size, cutoff_prob,
@ -235,7 +222,9 @@ class DeepSpeech2Model(nn.Layer):
                    rnn_size=config.model.rnn_layer_size,
                    use_gru=config.model.use_gru,
                    share_rnn_weights=config.model.share_rnn_weights)
-        checkpoint.load_parameters(model, checkpoint_path=checkpoint_path)
+        infos = checkpoint.load_parameters(
            model, checkpoint_path=checkpoint_path)
        logger.info(f"checkpoint info: {infos}")
        layer_tools.summary(model)
        return model
@ -262,12 +251,12 @@ class DeepSpeech2InferModel(DeepSpeech2Model):
        """export model function
        Args:
-            audio (Tensor): [B, D, T]
+            audio (Tensor): [B, T, D]
            audio_len (Tensor): [B]
        Returns:
            probs: probs after softmax
        """
        eouts, eouts_len = self.encoder(audio, audio_len)
-        probs = self.decoder.probs(eouts)
+        probs = self.decoder.softmax(eouts)
        return probs
--- a/deepspeech/models/u2.py
+++ b/deepspeech/models/u2.py
@ -0,0 +1,928 @@
 # Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 """U2 ASR Model
 Unified Streaming and Non-streaming Two-pass End-to-end Model for Speech Recognition 
 (https://arxiv.org/pdf/2012.05481.pdf)
 """
 import sys
 import time
 from collections import defaultdict
 from typing import Dict
 from typing import List
 from typing import Optional
 from typing import Tuple
 import paddle
 from paddle import jit
 from paddle import nn
 from yacs.config import CfgNode
 from deepspeech.frontend.utility import IGNORE_ID
 from deepspeech.frontend.utility import load_cmvn
 from deepspeech.modules.cmvn import GlobalCMVN
 from deepspeech.modules.ctc import CTCDecoder
 from deepspeech.modules.decoder import TransformerDecoder
 from deepspeech.modules.encoder import ConformerEncoder
 from deepspeech.modules.encoder import TransformerEncoder
 from deepspeech.modules.loss import LabelSmoothingLoss
 from deepspeech.modules.mask import make_pad_mask
 from deepspeech.modules.mask import mask_finished_preds
 from deepspeech.modules.mask import mask_finished_scores
 from deepspeech.modules.mask import subsequent_mask
 from deepspeech.utils import checkpoint
 from deepspeech.utils import layer_tools
 from deepspeech.utils.ctc_utils import remove_duplicates_and_blank
 from deepspeech.utils.log import Log
 from deepspeech.utils.tensor_utils import add_sos_eos
 from deepspeech.utils.tensor_utils import pad_sequence
 from deepspeech.utils.tensor_utils import th_accuracy
 from deepspeech.utils.utility import log_add
 __all__ = ["U2Model", "U2InferModel"]
 logger = Log(__name__).getlog()
 class U2BaseModel(nn.Module):
    """CTC-Attention hybrid Encoder-Decoder model"""
    @classmethod
    def params(cls, config: Optional[CfgNode]=None) -> CfgNode:
        # network architecture
        default = CfgNode()
        # allow add new item when merge_with_file
        default.cmvn_file = ""
        default.cmvn_file_type = "json"
        default.input_dim = 0
        default.output_dim = 0
        # encoder related
        default.encoder = 'transformer'
        default.encoder_conf = CfgNode(
            dict(
                output_size=256,  # dimension of attention
                attention_heads=4,
                linear_units=2048,  # the number of units of position-wise feed forward
                num_blocks=12,  # the number of encoder blocks
                dropout_rate=0.1,
                positional_dropout_rate=0.1,
                attention_dropout_rate=0.0,
                input_layer='conv2d',  # encoder input type, you can chose conv2d, conv2d6 and conv2d8
                normalize_before=True,
                # use_cnn_module=True,
                # cnn_module_kernel=15,
                # activation_type='swish',
                # pos_enc_layer_type='rel_pos',
                # selfattention_layer_type='rel_selfattn', 
            ))
        # decoder related
        default.decoder = 'transformer'
        default.decoder_conf = CfgNode(
            dict(
                attention_heads=4,
                linear_units=2048,
                num_blocks=6,
                dropout_rate=0.1,
                positional_dropout_rate=0.1,
                self_attention_dropout_rate=0.0,
                src_attention_dropout_rate=0.0, ))
        # hybrid CTC/attention
        default.model_conf = CfgNode(
            dict(
                ctc_weight=0.3,
                lsm_weight=0.1,  # label smoothing option
                length_normalized_loss=False, ))
        if config is not None:
            config.merge_from_other_cfg(default)
        return default
    def __init__(self,
                 vocab_size: int,
                 encoder: TransformerEncoder,
                 decoder: TransformerDecoder,
                 ctc: CTCDecoder,
                 ctc_weight: float=0.5,
                 ignore_id: int=IGNORE_ID,
                 lsm_weight: float=0.0,
                 length_normalized_loss: bool=False):
        assert 0.0 <= ctc_weight <= 1.0, ctc_weight
        super().__init__()
        # note that eos is the same as sos (equivalent ID)
        self.sos = vocab_size - 1
        self.eos = vocab_size - 1
        self.vocab_size = vocab_size
        self.ignore_id = ignore_id
        self.ctc_weight = ctc_weight
        self.encoder = encoder
        self.decoder = decoder
        self.ctc = ctc
        self.criterion_att = LabelSmoothingLoss(
            size=vocab_size,
            padding_idx=ignore_id,
            smoothing=lsm_weight,
            normalize_length=length_normalized_loss, )
    def forward(
            self,
            speech: paddle.Tensor,
            speech_lengths: paddle.Tensor,
            text: paddle.Tensor,
            text_lengths: paddle.Tensor,
    ) -> Tuple[Optional[paddle.Tensor], Optional[paddle.Tensor], Optional[
            paddle.Tensor]]:
        """Frontend + Encoder + Decoder + Calc loss
        Args:
            speech: (Batch, Length, ...)
            speech_lengths: (Batch, )
            text: (Batch, Length)
            text_lengths: (Batch,)
        Returns:
            total_loss, attention_loss, ctc_loss
        """
        assert text_lengths.dim() == 1, text_lengths.shape
        # Check that batch_size is unified
        assert (speech.shape[0] == speech_lengths.shape[0] == text.shape[0] ==
                text_lengths.shape[0]), (speech.shape, speech_lengths.shape,
                                         text.shape, text_lengths.shape)
        # 1. Encoder
        start = time.time()
        encoder_out, encoder_mask = self.encoder(speech, speech_lengths)
        encoder_time = time.time() - start
        #logger.debug(f"encoder time: {encoder_time}")
        #TODO(Hui Zhang): sum not support bool type
        #encoder_out_lens = encoder_mask.squeeze(1).sum(1)  #[B, 1, T] -> [B]
        encoder_out_lens = encoder_mask.squeeze(1).cast(paddle.int64).sum(
            1)  #[B, 1, T] -> [B]
        # 2a. Attention-decoder branch
        loss_att = None
        if self.ctc_weight != 1.0:
            start = time.time()
            loss_att, acc_att = self._calc_att_loss(encoder_out, encoder_mask,
                                                    text, text_lengths)
            decoder_time = time.time() - start
            #logger.debug(f"decoder time: {decoder_time}")
        # 2b. CTC branch
        loss_ctc = None
        if self.ctc_weight != 0.0:
            start = time.time()
            loss_ctc = self.ctc(encoder_out, encoder_out_lens, text,
                                text_lengths)
            ctc_time = time.time() - start
            #logger.debug(f"ctc time: {ctc_time}")
        if loss_ctc is None:
            loss = loss_att
        elif loss_att is None:
            loss = loss_ctc
        else:
            loss = self.ctc_weight * loss_ctc + (1 - self.ctc_weight) * loss_att
        return loss, loss_att, loss_ctc
    def _calc_att_loss(
            self,
            encoder_out: paddle.Tensor,
            encoder_mask: paddle.Tensor,
            ys_pad: paddle.Tensor,
            ys_pad_lens: paddle.Tensor, ) -> Tuple[paddle.Tensor, float]:
        """Calc attention loss.
        Args:
            encoder_out (paddle.Tensor): [B, Tmax, D]
            encoder_mask (paddle.Tensor): [B, 1, Tmax]
            ys_pad (paddle.Tensor): [B, Umax]
            ys_pad_lens (paddle.Tensor): [B]
        Returns:
            Tuple[paddle.Tensor, float]: attention_loss, accuracy rate
        """
        ys_in_pad, ys_out_pad = add_sos_eos(ys_pad, self.sos, self.eos,
                                            self.ignore_id)
        ys_in_lens = ys_pad_lens + 1
        # 1. Forward decoder
        decoder_out, _ = self.decoder(encoder_out, encoder_mask, ys_in_pad,
                                      ys_in_lens)
        # 2. Compute attention loss
        loss_att = self.criterion_att(decoder_out, ys_out_pad)
        acc_att = th_accuracy(
            decoder_out.view(-1, self.vocab_size),
            ys_out_pad,
            ignore_label=self.ignore_id, )
        return loss_att, acc_att
    def _forward_encoder(
            self,
            speech: paddle.Tensor,
            speech_lengths: paddle.Tensor,
            decoding_chunk_size: int=-1,
            num_decoding_left_chunks: int=-1,
            simulate_streaming: bool=False,
    ) -> Tuple[paddle.Tensor, paddle.Tensor]:
        """Encoder pass.
        Args:
            speech (paddle.Tensor): [B, Tmax, D]
            speech_lengths (paddle.Tensor): [B]
            decoding_chunk_size (int, optional): chuck size. Defaults to -1.
            num_decoding_left_chunks (int, optional): nums chunks. Defaults to -1.
            simulate_streaming (bool, optional): streaming or not. Defaults to False.
        Returns:
            Tuple[paddle.Tensor, paddle.Tensor]: 
                encoder hiddens (B, Tmax, D), 
                encoder hiddens mask (B, 1, Tmax).
        """
        # Let's assume B = batch_size
        # 1. Encoder
        if simulate_streaming and decoding_chunk_size > 0:
            encoder_out, encoder_mask = self.encoder.forward_chunk_by_chunk(
                speech,
                decoding_chunk_size=decoding_chunk_size,
                num_decoding_left_chunks=num_decoding_left_chunks
            )  # (B, maxlen, encoder_dim)
        else:
            encoder_out, encoder_mask = self.encoder(
                speech,
                speech_lengths,
                decoding_chunk_size=decoding_chunk_size,
                num_decoding_left_chunks=num_decoding_left_chunks
            )  # (B, maxlen, encoder_dim)
        return encoder_out, encoder_mask
    def recognize(
            self,
            speech: paddle.Tensor,
            speech_lengths: paddle.Tensor,
            beam_size: int=10,
            decoding_chunk_size: int=-1,
            num_decoding_left_chunks: int=-1,
            simulate_streaming: bool=False, ) -> paddle.Tensor:
        """ Apply beam search on attention decoder
        Args:
            speech (paddle.Tensor): (batch, max_len, feat_dim)
            speech_length (paddle.Tensor): (batch, )
            beam_size (int): beam size for beam search
            decoding_chunk_size (int): decoding chunk for dynamic chunk
                trained model.
                <0: for decoding, use full chunk.
                >0: for decoding, use fixed chunk size as set.
                0: used for training, it's prohibited here
            simulate_streaming (bool): whether do encoder forward in a
                streaming fashion
        Returns:
            paddle.Tensor: decoding result, (batch, max_result_len)
        """
        assert speech.shape[0] == speech_lengths.shape[0]
        assert decoding_chunk_size != 0
        device = speech.place
        batch_size = speech.shape[0]
        # Let's assume B = batch_size and N = beam_size
        # 1. Encoder
        encoder_out, encoder_mask = self._forward_encoder(
            speech, speech_lengths, decoding_chunk_size,
            num_decoding_left_chunks,
            simulate_streaming)  # (B, maxlen, encoder_dim)
        maxlen = encoder_out.size(1)
        encoder_dim = encoder_out.size(2)
        running_size = batch_size * beam_size
        encoder_out = encoder_out.unsqueeze(1).repeat(1, beam_size, 1, 1).view(
            running_size, maxlen, encoder_dim)  # (B*N, maxlen, encoder_dim)
        encoder_mask = encoder_mask.unsqueeze(1).repeat(
            1, beam_size, 1, 1).view(running_size, 1,
                                     maxlen)  # (B*N, 1, max_len)
        hyps = paddle.ones(
            [running_size, 1], dtype=paddle.long).fill_(self.sos)  # (B*N, 1)
        # log scale score
        scores = paddle.to_tensor(
            [0.0] + [-float('inf')] * (beam_size - 1), dtype=paddle.float)
        scores = scores.to(device).repeat(batch_size).unsqueeze(1).to(
            device)  # (B*N, 1)
        end_flag = paddle.zeros_like(scores, dtype=paddle.bool)  # (B*N, 1)
        cache: Optional[List[paddle.Tensor]] = None
        # 2. Decoder forward step by step
        for i in range(1, maxlen + 1):
            # Stop if all batch and all beam produce eos
            # TODO(Hui Zhang): if end_flag.sum() == running_size:
            if end_flag.cast(paddle.int64).sum() == running_size:
                break
            # 2.1 Forward decoder step
            hyps_mask = subsequent_mask(i).unsqueeze(0).repeat(
                running_size, 1, 1).to(device)  # (B*N, i, i)
            # logp: (B*N, vocab)
            logp, cache = self.decoder.forward_one_step(
                encoder_out, encoder_mask, hyps, hyps_mask, cache)
            # 2.2 First beam prune: select topk best prob at current time
            top_k_logp, top_k_index = logp.topk(beam_size)  # (B*N, N)
            top_k_logp = mask_finished_scores(top_k_logp, end_flag)
            top_k_index = mask_finished_preds(top_k_index, end_flag, self.eos)
            # 2.3 Seconde beam prune: select topk score with history
            scores = scores + top_k_logp  # (B*N, N), broadcast add
            scores = scores.view(batch_size, beam_size * beam_size)  # (B, N*N)
            scores, offset_k_index = scores.topk(k=beam_size)  # (B, N)
            scores = scores.view(-1, 1)  # (B*N, 1)
            # 2.4. Compute base index in top_k_index,
            # regard top_k_index as (B*N*N),regard offset_k_index as (B*N),
            # then find offset_k_index in top_k_index
            base_k_index = paddle.arange(batch_size).view(-1, 1).repeat(
                1, beam_size)  # (B, N)
            base_k_index = base_k_index * beam_size * beam_size
            best_k_index = base_k_index.view(-1) + offset_k_index.view(
                -1)  # (B*N)
            # 2.5 Update best hyps
            best_k_pred = paddle.index_select(
                top_k_index.view(-1), index=best_k_index, axis=0)  # (B*N)
            best_hyps_index = best_k_index // beam_size
            last_best_k_hyps = paddle.index_select(
                hyps, index=best_hyps_index, axis=0)  # (B*N, i)
            hyps = paddle.cat(
                (last_best_k_hyps, best_k_pred.view(-1, 1)),
                dim=1)  # (B*N, i+1)
            # 2.6 Update end flag
            end_flag = paddle.eq(hyps[:, -1], self.eos).view(-1, 1)
        # 3. Select best of best
        scores = scores.view(batch_size, beam_size)
        # TODO: length normalization
        best_index = paddle.argmax(scores, axis=-1).long()  # (B)
        best_hyps_index = best_index + paddle.arange(
            batch_size, dtype=paddle.long) * beam_size
        best_hyps = paddle.index_select(hyps, index=best_hyps_index, axis=0)
        best_hyps = best_hyps[:, 1:]
        return best_hyps
    def ctc_greedy_search(
            self,
            speech: paddle.Tensor,
            speech_lengths: paddle.Tensor,
            decoding_chunk_size: int=-1,
            num_decoding_left_chunks: int=-1,
            simulate_streaming: bool=False, ) -> List[List[int]]:
        """ Apply CTC greedy search
        Args:
            speech (paddle.Tensor): (batch, max_len, feat_dim)
            speech_length (paddle.Tensor): (batch, )
            beam_size (int): beam size for beam search
            decoding_chunk_size (int): decoding chunk for dynamic chunk
                trained model.
                <0: for decoding, use full chunk.
                >0: for decoding, use fixed chunk size as set.
                0: used for training, it's prohibited here
            simulate_streaming (bool): whether do encoder forward in a
                streaming fashion
        Returns:
            List[List[int]]: best path result
        """
        assert speech.shape[0] == speech_lengths.shape[0]
        assert decoding_chunk_size != 0
        batch_size = speech.shape[0]
        # Let's assume B = batch_size
        # encoder_out: (B, maxlen, encoder_dim)
        # encoder_mask: (B, 1, Tmax)
        encoder_out, encoder_mask = self._forward_encoder(
            speech, speech_lengths, decoding_chunk_size,
            num_decoding_left_chunks, simulate_streaming)
        maxlen = encoder_out.size(1)
        # (TODO Hui Zhang): bool no support reduce_sum
        # encoder_out_lens = encoder_mask.squeeze(1).sum(1)
        encoder_out_lens = encoder_mask.squeeze(1).astype(paddle.int).sum(1)
        ctc_probs = self.ctc.log_softmax(encoder_out)  # (B, maxlen, vocab_size)
        topk_prob, topk_index = ctc_probs.topk(1, axis=2)  # (B, maxlen, 1)
        topk_index = topk_index.view(batch_size, maxlen)  # (B, maxlen)
        pad_mask = make_pad_mask(encoder_out_lens)  # (B, maxlen)
        topk_index = topk_index.masked_fill_(pad_mask, self.eos)  # (B, maxlen)
        hyps = [hyp.tolist() for hyp in topk_index]
        hyps = [remove_duplicates_and_blank(hyp) for hyp in hyps]
        return hyps
    def _ctc_prefix_beam_search(
            self,
            speech: paddle.Tensor,
            speech_lengths: paddle.Tensor,
            beam_size: int,
            decoding_chunk_size: int=-1,
            num_decoding_left_chunks: int=-1,
            simulate_streaming: bool=False,
            blank_id: int=0, ) -> Tuple[List[Tuple[int, float]], paddle.Tensor]:
        """ CTC prefix beam search inner implementation
        Args:
            speech (paddle.Tensor): (batch, max_len, feat_dim)
            speech_length (paddle.Tensor): (batch, )
            beam_size (int): beam size for beam search
            decoding_chunk_size (int): decoding chunk for dynamic chunk
                trained model.
                <0: for decoding, use full chunk.
                >0: for decoding, use fixed chunk size as set.
                0: used for training, it's prohibited here
            simulate_streaming (bool): whether do encoder forward in a
                streaming fashion
        Returns:
            List[Tuple[int, float]]: nbest results, (N,1), (text, likelihood)
            paddle.Tensor: encoder output, (1, max_len, encoder_dim),
                it will be used for rescoring in attention rescoring mode
        """
        assert speech.shape[0] == speech_lengths.shape[0]
        assert decoding_chunk_size != 0
        batch_size = speech.shape[0]
        # For CTC prefix beam search, we only support batch_size=1
        assert batch_size == 1
        # Let's assume B = batch_size and N = beam_size
        # 1. Encoder forward and get CTC score
        encoder_out, encoder_mask = self._forward_encoder(
            speech, speech_lengths, decoding_chunk_size,
            num_decoding_left_chunks,
            simulate_streaming)  # (B, maxlen, encoder_dim)
        maxlen = encoder_out.size(1)
        ctc_probs = self.ctc.log_softmax(encoder_out)  # (1, maxlen, vocab_size)
        ctc_probs = ctc_probs.squeeze(0)
        # cur_hyps: (prefix, (blank_ending_score, none_blank_ending_score))
        cur_hyps = [(tuple(), (0.0, -float('inf')))]
        # 2. CTC beam search step by step
        for t in range(0, maxlen):
            logp = ctc_probs[t]  # (vocab_size,)
            # key: prefix, value (pb, pnb), default value(-inf, -inf)
            next_hyps = defaultdict(lambda: (-float('inf'), -float('inf')))
            # 2.1 First beam prune: select topk best
            top_k_logp, top_k_index = logp.topk(beam_size)  # (beam_size,)
            for s in top_k_index:
                s = s.item()
                ps = logp[s].item()
                for prefix, (pb, pnb) in cur_hyps:
                    last = prefix[-1] if len(prefix) > 0 else None
                    if s == blank_id:  # blank
                        n_pb, n_pnb = next_hyps[prefix]
                        n_pb = log_add([n_pb, pb + ps, pnb + ps])
                        next_hyps[prefix] = (n_pb, n_pnb)
                    elif s == last:
                        #  Update *ss -> *s;
                        n_pb, n_pnb = next_hyps[prefix]
                        n_pnb = log_add([n_pnb, pnb + ps])
                        next_hyps[prefix] = (n_pb, n_pnb)
                        # Update *s-s -> *ss, - is for blank
                        n_prefix = prefix + (s, )
                        n_pb, n_pnb = next_hyps[n_prefix]
                        n_pnb = log_add([n_pnb, pb + ps])
                        next_hyps[n_prefix] = (n_pb, n_pnb)
                    else:
                        n_prefix = prefix + (s, )
                        n_pb, n_pnb = next_hyps[n_prefix]
                        n_pnb = log_add([n_pnb, pb + ps, pnb + ps])
                        next_hyps[n_prefix] = (n_pb, n_pnb)
            # 2.2 Second beam prune
            next_hyps = sorted(
                next_hyps.items(),
                key=lambda x: log_add(list(x[1])),
                reverse=True)
            cur_hyps = next_hyps[:beam_size]
        hyps = [(y[0], log_add([y[1][0], y[1][1]])) for y in cur_hyps]
        return hyps, encoder_out
    def ctc_prefix_beam_search(
            self,
            speech: paddle.Tensor,
            speech_lengths: paddle.Tensor,
            beam_size: int,
            decoding_chunk_size: int=-1,
            num_decoding_left_chunks: int=-1,
            simulate_streaming: bool=False, ) -> List[int]:
        """ Apply CTC prefix beam search
        Args:
            speech (paddle.Tensor): (batch, max_len, feat_dim)
            speech_length (paddle.Tensor): (batch, )
            beam_size (int): beam size for beam search
            decoding_chunk_size (int): decoding chunk for dynamic chunk
                trained model.
                <0: for decoding, use full chunk.
                >0: for decoding, use fixed chunk size as set.
                0: used for training, it's prohibited here
            simulate_streaming (bool): whether do encoder forward in a
                streaming fashion
        Returns:
            List[int]: CTC prefix beam search nbest results
        """
        hyps, _ = self._ctc_prefix_beam_search(
            speech, speech_lengths, beam_size, decoding_chunk_size,
            num_decoding_left_chunks, simulate_streaming)
        return hyps[0][0]
    def attention_rescoring(
            self,
            speech: paddle.Tensor,
            speech_lengths: paddle.Tensor,
            beam_size: int,
            decoding_chunk_size: int=-1,
            num_decoding_left_chunks: int=-1,
            ctc_weight: float=0.0,
            simulate_streaming: bool=False, ) -> List[int]:
        """ Apply attention rescoring decoding, CTC prefix beam search
            is applied first to get nbest, then we resoring the nbest on
            attention decoder with corresponding encoder out
        Args:
            speech (paddle.Tensor): (batch, max_len, feat_dim)
            speech_length (paddle.Tensor): (batch, )
            beam_size (int): beam size for beam search
            decoding_chunk_size (int): decoding chunk for dynamic chunk
                trained model.
                <0: for decoding, use full chunk.
                >0: for decoding, use fixed chunk size as set.
                0: used for training, it's prohibited here
            simulate_streaming (bool): whether do encoder forward in a
                streaming fashion
        Returns:
            List[int]: Attention rescoring result
        """
        assert speech.shape[0] == speech_lengths.shape[0]
        assert decoding_chunk_size != 0
        device = speech.place
        batch_size = speech.shape[0]
        # For attention rescoring we only support batch_size=1
        assert batch_size == 1
        # encoder_out: (1, maxlen, encoder_dim), len(hyps) = beam_size
        hyps, encoder_out = self._ctc_prefix_beam_search(
            speech, speech_lengths, beam_size, decoding_chunk_size,
            num_decoding_left_chunks, simulate_streaming)
        assert len(hyps) == beam_size
        hyps_pad = pad_sequence([
            paddle.to_tensor(hyp[0], place=device, dtype=paddle.long)
            for hyp in hyps
        ], True, self.ignore_id)  # (beam_size, max_hyps_len)
        hyps_lens = paddle.to_tensor(
            [len(hyp[0]) for hyp in hyps], place=device,
            dtype=paddle.long)  # (beam_size,)
        hyps_pad, _ = add_sos_eos(hyps_pad, self.sos, self.eos, self.ignore_id)
        hyps_lens = hyps_lens + 1  # Add <sos> at begining
        encoder_out = encoder_out.repeat(beam_size, 1, 1)
        encoder_mask = paddle.ones(
            (beam_size, 1, encoder_out.size(1)), dtype=paddle.bool)
        decoder_out, _ = self.decoder(
            encoder_out, encoder_mask, hyps_pad,
            hyps_lens)  # (beam_size, max_hyps_len, vocab_size)
        decoder_out = paddle.nn.functional.log_softmax(decoder_out, axis=-1)
        decoder_out = decoder_out.numpy()
        # Only use decoder score for rescoring
        best_score = -float('inf')
        best_index = 0
        for i, hyp in enumerate(hyps):
            score = 0.0
            for j, w in enumerate(hyp[0]):
                score += decoder_out[i][j][w]
            score += decoder_out[i][len(hyp[0])][self.eos]
            # add ctc score
            score += hyp[1] * ctc_weight
            if score > best_score:
                best_score = score
                best_index = i
        return hyps[best_index][0]
    @jit.export
    def subsampling_rate(self) -> int:
        """ Export interface for c++ call, return subsampling_rate of the
            model
        """
        return self.encoder.embed.subsampling_rate
    @jit.export
    def right_context(self) -> int:
        """ Export interface for c++ call, return right_context of the model
        """
        return self.encoder.embed.right_context
    @jit.export
    def sos_symbol(self) -> int:
        """ Export interface for c++ call, return sos symbol id of the model
        """
        return self.sos
    @jit.export
    def eos_symbol(self) -> int:
        """ Export interface for c++ call, return eos symbol id of the model
        """
        return self.eos
    @jit.export
    def forward_encoder_chunk(
            self,
            xs: paddle.Tensor,
            offset: int,
            required_cache_size: int,
            subsampling_cache: Optional[paddle.Tensor]=None,
            elayers_output_cache: Optional[List[paddle.Tensor]]=None,
            conformer_cnn_cache: Optional[List[paddle.Tensor]]=None,
    ) -> Tuple[paddle.Tensor, paddle.Tensor, List[paddle.Tensor], List[
            paddle.Tensor]]:
        """ Export interface for c++ call, give input chunk xs, and return
            output from time 0 to current chunk.
        Args:
            xs (paddle.Tensor): chunk input
            subsampling_cache (Optional[paddle.Tensor]): subsampling cache
            elayers_output_cache (Optional[List[paddle.Tensor]]):
                transformer/conformer encoder layers output cache
            conformer_cnn_cache (Optional[List[paddle.Tensor]]): conformer
                cnn cache
        Returns:
            paddle.Tensor: output, it ranges from time 0 to current chunk.
            paddle.Tensor: subsampling cache
            List[paddle.Tensor]: attention cache
            List[paddle.Tensor]: conformer cnn cache
        """
        return self.encoder.forward_chunk(
            xs, offset, required_cache_size, subsampling_cache,
            elayers_output_cache, conformer_cnn_cache)
    @jit.export
    def ctc_activation(self, xs: paddle.Tensor) -> paddle.Tensor:
        """ Export interface for c++ call, apply linear transform and log
            softmax before ctc
        Args:
            xs (paddle.Tensor): encoder output
        Returns:
            paddle.Tensor: activation before ctc
        """
        return self.ctc.log_softmax(xs)
    @jit.export
    def forward_attention_decoder(
            self,
            hyps: paddle.Tensor,
            hyps_lens: paddle.Tensor,
            encoder_out: paddle.Tensor, ) -> paddle.Tensor:
        """ Export interface for c++ call, forward decoder with multiple
            hypothesis from ctc prefix beam search and one encoder output
        Args:
            hyps (paddle.Tensor): hyps from ctc prefix beam search, already
                pad sos at the begining, (B, T)
            hyps_lens (paddle.Tensor): length of each hyp in hyps, (B)
            encoder_out (paddle.Tensor): corresponding encoder output, (B=1, T, D)
        Returns:
            paddle.Tensor: decoder output, (B, L)
        """
        assert encoder_out.size(0) == 1
        num_hyps = hyps.size(0)
        assert hyps_lens.size(0) == num_hyps
        encoder_out = encoder_out.repeat(num_hyps, 1, 1)
        # (B, 1, T)
        encoder_mask = paddle.ones(
            [num_hyps, 1, encoder_out.size(1)], dtype=paddle.bool)
        # (num_hyps, max_hyps_len, vocab_size)
        decoder_out, _ = self.decoder(encoder_out, encoder_mask, hyps,
                                      hyps_lens)
        decoder_out = paddle.nn.functional.log_softmax(decoder_out, dim=-1)
        return decoder_out
    @paddle.no_grad()
    def decode(self,
               feats: paddle.Tensor,
               feats_lengths: paddle.Tensor,
               text_feature: Dict[str, int],
               decoding_method: str,
               lang_model_path: str,
               beam_alpha: float,
               beam_beta: float,
               beam_size: int,
               cutoff_prob: float,
               cutoff_top_n: int,
               num_processes: int,
               ctc_weight: float=0.0,
               decoding_chunk_size: int=-1,
               num_decoding_left_chunks: int=-1,
               simulate_streaming: bool=False):
        """u2 decoding.
        Args:
            feats (Tenosr): audio features, (B, T, D)
            feats_lengths (Tenosr): (B)
            text_feature (TextFeaturizer): text feature object.
            decoding_method (str): decoding mode, e.g. 
                    'attention', 'ctc_greedy_search', 
                    'ctc_prefix_beam_search', 'attention_rescoring'
            lang_model_path (str): lm path.
            beam_alpha (float): lm weight.
            beam_beta (float): length penalty.
            beam_size (int): beam size for search
            cutoff_prob (float): for prune.
            cutoff_top_n (int): for prune.
            num_processes (int): 
            ctc_weight (float, optional): ctc weight for attention rescoring decode mode. Defaults to 0.0.
            decoding_chunk_size (int, optional): decoding chunk size. Defaults to -1.
                    <0: for decoding, use full chunk.
                    >0: for decoding, use fixed chunk size as set.
                    0: used for training, it's prohibited here. 
            num_decoding_left_chunks (int, optional): 
                    number of left chunks for decoding. Defaults to -1.
            simulate_streaming (bool, optional): simulate streaming inference. Defaults to False.
        Raises:
            ValueError: when not support decoding_method.
        Returns:
            List[List[int]]: transcripts.
        """
        batch_size = feats.size(0)
        if decoding_method in ['ctc_prefix_beam_search',
                               'attention_rescoring'] and batch_size > 1:
            logger.fatal(
                f'decoding mode {decoding_method} must be running with batch_size == 1'
            )
            sys.exit(1)
        if decoding_method == 'attention':
            hyps = self.recognize(
                feats,
                feats_lengths,
                beam_size=beam_size,
                decoding_chunk_size=decoding_chunk_size,
                num_decoding_left_chunks=num_decoding_left_chunks,
                simulate_streaming=simulate_streaming)
            hyps = [hyp.tolist() for hyp in hyps]
        elif decoding_method == 'ctc_greedy_search':
            hyps = self.ctc_greedy_search(
                feats,
                feats_lengths,
                decoding_chunk_size=decoding_chunk_size,
                num_decoding_left_chunks=num_decoding_left_chunks,
                simulate_streaming=simulate_streaming)
        # ctc_prefix_beam_search and attention_rescoring only return one
        # result in List[int], change it to List[List[int]] for compatible
        # with other batch decoding mode
        elif decoding_method == 'ctc_prefix_beam_search':
            assert feats.size(0) == 1
            hyp = self.ctc_prefix_beam_search(
                feats,
                feats_lengths,
                beam_size,
                decoding_chunk_size=decoding_chunk_size,
                num_decoding_left_chunks=num_decoding_left_chunks,
                simulate_streaming=simulate_streaming)
            hyps = [hyp]
        elif decoding_method == 'attention_rescoring':
            assert feats.size(0) == 1
            hyp = self.attention_rescoring(
                feats,
                feats_lengths,
                beam_size,
                decoding_chunk_size=decoding_chunk_size,
                num_decoding_left_chunks=num_decoding_left_chunks,
                ctc_weight=ctc_weight,
                simulate_streaming=simulate_streaming)
            hyps = [hyp]
        else:
            raise ValueError(f"Not support decoding method: {decoding_method}")
        res = [text_feature.defeaturize(hyp) for hyp in hyps]
        return res
 class U2Model(U2BaseModel):
    def __init__(self, configs: dict):
        vocab_size, encoder, decoder, ctc = U2Model._init_from_config(configs)
        super().__init__(
            vocab_size=vocab_size,
            encoder=encoder,
            decoder=decoder,
            ctc=ctc,
            **configs['model_conf'])
    @classmethod
    def _init_from_config(cls, configs: dict):
        """init sub module for model.
        Args:
            configs (dict): config dict.
        Raises:
            ValueError: raise when using not support encoder type.
        Returns:
            int, nn.Layer, nn.Layer, nn.Layer: vocab size, encoder, decoder, ctc 
        """
        if configs['cmvn_file'] is not None:
            mean, istd = load_cmvn(configs['cmvn_file'],
                                   configs['cmvn_file_type'])
            global_cmvn = GlobalCMVN(
                paddle.to_tensor(mean, dtype=paddle.float),
                paddle.to_tensor(istd, dtype=paddle.float))
        else:
            global_cmvn = None
        input_dim = configs['input_dim']
        vocab_size = configs['output_dim']
        assert input_dim != 0, input_dim
        assert vocab_size != 0, vocab_size
        encoder_type = configs.get('encoder', 'transformer')
        logger.info(f"U2 Encoder type: {encoder_type}")
        if encoder_type == 'transformer':
            encoder = TransformerEncoder(
                input_dim, global_cmvn=global_cmvn, **configs['encoder_conf'])
        elif encoder_type == 'conformer':
            encoder = ConformerEncoder(
                input_dim, global_cmvn=global_cmvn, **configs['encoder_conf'])
        else:
            raise ValueError(f"not support encoder type:{encoder_type}")
        decoder = TransformerDecoder(vocab_size,
                                     encoder.output_size(),
                                     **configs['decoder_conf'])
        ctc = CTCDecoder(
            odim=vocab_size,
            enc_n_units=encoder.output_size(),
            blank_id=0,
            dropout_rate=0.0,
            reduction=True,  # sum
            batch_average=True)  # sum / batch_size
        return vocab_size, encoder, decoder, ctc
    @classmethod
    def from_config(cls, configs: dict):
        """init model.
        Args:
            configs (dict): config dict.
        Raises:
            ValueError: raise when using not support encoder type.
        Returns:
            nn.Layer: U2Model
        """
        model = cls(configs)
        return model
    @classmethod
    def from_pretrained(cls, dataset, config, checkpoint_path):
        """Build a DeepSpeech2Model model from a pretrained model.
        Args:
            dataset (paddle.io.Dataset): not used.
            config (yacs.config.CfgNode):  model configs
            checkpoint_path (Path or str): the path of pretrained model checkpoint, without extension name
        Returns:
            DeepSpeech2Model: The model built from pretrained result.
        """
        config.defrost()
        config.input_dim = dataset.feature_size
        config.output_dim = dataset.vocab_size
        config.freeze()
        model = cls.from_config(config)
        if checkpoint_path:
            infos = checkpoint.load_parameters(
                model, checkpoint_path=checkpoint_path)
            logger.info(f"checkpoint info: {infos}")
        layer_tools.summary(model)
        return model
 class U2InferModel(U2Model):
    def __init__(self, configs: dict):
        super().__init__(configs)
    def forward(self,
                feats,
                feats_lengths,
                decoding_chunk_size=-1,
                num_decoding_left_chunks=-1,
                simulate_streaming=False):
        """export model function
        Args:
            feats (Tensor): [B, T, D]
            feats_lengths (Tensor): [B]
        Returns:
            List[List[int]]: best path result
        """
        return self.ctc_greedy_search(
            feats,
            feats_lengths,
            decoding_chunk_size=decoding_chunk_size,
            num_decoding_left_chunks=num_decoding_left_chunks,
            simulate_streaming=simulate_streaming)
--- a/deepspeech/modules/activation.py
+++ b/deepspeech/modules/activation.py
@ -11,19 +11,16 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
-
+from collections import OrderedDict
 import logging
 import numpy as np
 import math
 import paddle
 from paddle import nn
 from paddle.nn import functional as F
 from paddle.nn import initializer as I
-logger = logging.getLogger(__name__)
+from deepspeech.utils.log import Log
 logger = Log(__name__).getlog()
-__all__ = ['brelu', "softplus", "gelu_accurate", "gelu", 'Swish']
+__all__ = ["get_activation", "brelu", "LinearGLUBlock", "ConvGLUBlock"]
 def brelu(x, t_min=0.0, t_max=24.0, name=None):
@ -33,36 +30,116 @@ def brelu(x, t_min=0.0, t_max=24.0, name=None):
    return x.maximum(t_min).minimum(t_max)
-def softplus(x):
+class LinearGLUBlock(nn.Layer):
-    """Softplus function."""
+    """A linear Gated Linear Units (GLU) block."""
-    if hasattr(paddle.nn.functional, 'softplus'):
+
-        #return paddle.nn.functional.softplus(x.float()).type_as(x)
+    def __init__(self, idim: int):
-        return paddle.nn.functional.softplus(x)
+        """ GLU.
-    else:
+        Args:
-        raise NotImplementedError
+            idim (int): input and output dimension
        """
        super().__init__()
        self.fc = nn.Linear(idim, idim * 2)
    def forward(self, xs):
        return glu(self.fc(xs), dim=-1)
 class ConvGLUBlock(nn.Layer):
    def __init__(self, kernel_size, in_ch, out_ch, bottlececk_dim=0,
                 dropout=0.):
        """A convolutional Gated Linear Units (GLU) block.
        Args:
            kernel_size (int): kernel size
            in_ch (int): number of input channels
            out_ch (int): number of output channels
            bottlececk_dim (int): dimension of the bottleneck layers for computational efficiency. Defaults to 0.
            dropout (float): dropout probability. Defaults to 0..
        """
        super().__init__()
        self.conv_residual = None
        if in_ch != out_ch:
            self.conv_residual = nn.utils.weight_norm(
                nn.Conv2D(
                    in_channels=in_ch, out_channels=out_ch, kernel_size=(1, 1)),
                name='weight',
                dim=0)
            self.dropout_residual = nn.Dropout(p=dropout)
        self.pad_left = ConstantPad2d((0, 0, kernel_size - 1, 0), 0)
        layers = OrderedDict()
        if bottlececk_dim == 0:
            layers['conv'] = nn.utils.weight_norm(
                nn.Conv2D(
                    in_channels=in_ch,
                    out_channels=out_ch * 2,
                    kernel_size=(kernel_size, 1)),
                name='weight',
                dim=0)
            # TODO(hirofumi0810): padding?
            layers['dropout'] = nn.Dropout(p=dropout)
            layers['glu'] = GLU()
-def gelu_accurate(x):
+        elif bottlececk_dim > 0:
-    """Gaussian Error Linear Units (GELU) activation."""
+            layers['conv_in'] = nn.utils.weight_norm(
-    # [reference] https://github.com/pytorch/fairseq/blob/e75cff5f2c1d62f12dc911e0bf420025eb1a4e33/fairseq/modules/gelu.py
+                nn.Conv2D(
-    if not hasattr(gelu_accurate, "_a"):
+                    in_channels=in_ch,
-        gelu_accurate._a = math.sqrt(2 / math.pi)
+                    out_channels=bottlececk_dim,
-    return 0.5 * x * (1 + paddle.tanh(gelu_accurate._a *
+                    kernel_size=(1, 1)),
-                                      (x + 0.044715 * paddle.pow(x, 3))))
+                name='weight',
                dim=0)
            layers['dropout_in'] = nn.Dropout(p=dropout)
            layers['conv_bottleneck'] = nn.utils.weight_norm(
                nn.Conv2D(
                    in_channels=bottlececk_dim,
                    out_channels=bottlececk_dim,
                    kernel_size=(kernel_size, 1)),
                name='weight',
                dim=0)
            layers['dropout'] = nn.Dropout(p=dropout)
            layers['glu'] = GLU()
            layers['conv_out'] = nn.utils.weight_norm(
                nn.Conv2D(
                    in_channels=bottlececk_dim,
                    out_channels=out_ch * 2,
                    kernel_size=(1, 1)),
                name='weight',
                dim=0)
            layers['dropout_out'] = nn.Dropout(p=dropout)
        self.layers = nn.Sequential(layers)
-def gelu(x):
+    def forward(self, xs):
-    """Gaussian Error Linear Units (GELU) activation."""
+        """Forward pass.
-    if hasattr(torch.nn.functional, 'gelu'):
+        Args:
-        #return torch.nn.functional.gelu(x.float()).type_as(x)
+            xs (FloatTensor): `[B, in_ch, T, feat_dim]`
-        return torch.nn.functional.gelu(x)
+        Returns:
-    else:
+            out (FloatTensor): `[B, out_ch, T, feat_dim]`
-        return x * 0.5 * (1.0 + paddle.erf(x / math.sqrt(2.0)))
+        """
        residual = xs
        if self.conv_residual is not None:
            residual = self.dropout_residual(self.conv_residual(residual))
        xs = self.pad_left(xs)  # `[B, embed_dim, T+kernel-1, 1]`
        xs = self.layers(xs)  # `[B, out_ch * 2, T ,1]`
        xs = xs + residual
        return xs
-class Swish(nn.Layer):
+def get_activation(act):
-    """Construct an Swish object."""
+    """Return activation function."""
    # Lazy load to avoid unused import
    activation_funcs = {
        "hardtanh": paddle.nn.Hardtanh,
        "tanh": paddle.nn.Tanh,
        "relu": paddle.nn.ReLU,
        "selu": paddle.nn.SELU,
        "swish": paddle.nn.Swish,
        "gelu": paddle.nn.GELU,
        "brelu": brelu,
    }
-    def forward(self, x: paddle.Tensor) -> paddle.Tensor:
+    return activation_funcs[act]()
        """Return Swish activation function."""
        return x * F.sigmoid(x)
--- a/deepspeech/modules/attention.py
+++ b/deepspeech/modules/attention.py
@ -0,0 +1,233 @@
 # Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 """Multi-Head Attention layer definition."""
 import math
 from typing import Optional
 from typing import Tuple
 import paddle
 from paddle import nn
 from paddle.nn import initializer as I
 from deepspeech.utils.log import Log
 logger = Log(__name__).getlog()
 __all__ = ["MultiHeadedAttention", "RelPositionMultiHeadedAttention"]
 # Relative Positional Encodings
 # https://www.jianshu.com/p/c0608efcc26f
 # https://zhuanlan.zhihu.com/p/344604604
 class MultiHeadedAttention(nn.Layer):
    """Multi-Head Attention layer."""
    def __init__(self, n_head: int, n_feat: int, dropout_rate: float):
        """Construct an MultiHeadedAttention object.
        Args:
            n_head (int): The number of heads.
            n_feat (int): The number of features.
            dropout_rate (float): Dropout rate.
        """
        super().__init__()
        assert n_feat % n_head == 0
        # We assume d_v always equals d_k
        self.d_k = n_feat // n_head
        self.h = n_head
        self.linear_q = nn.Linear(n_feat, n_feat)
        self.linear_k = nn.Linear(n_feat, n_feat)
        self.linear_v = nn.Linear(n_feat, n_feat)
        self.linear_out = nn.Linear(n_feat, n_feat)
        self.dropout = nn.Dropout(p=dropout_rate)
    def forward_qkv(self,
                    query: paddle.Tensor,
                    key: paddle.Tensor,
                    value: paddle.Tensor
                    ) -> Tuple[paddle.Tensor, paddle.Tensor, paddle.Tensor]:
        """Transform query, key and value.
        Args:
            query (paddle.Tensor): Query tensor (#batch, time1, size).
            key (paddle.Tensor): Key tensor (#batch, time2, size).
            value (paddle.Tensor): Value tensor (#batch, time2, size).
        Returns:
            paddle.Tensor: Transformed query tensor, size
                (#batch, n_head, time1, d_k).
            paddle.Tensor: Transformed key tensor, size
                (#batch, n_head, time2, d_k).
            paddle.Tensor: Transformed value tensor, size
                (#batch, n_head, time2, d_k).
        """
        n_batch = query.size(0)
        q = self.linear_q(query).view(n_batch, -1, self.h, self.d_k)
        k = self.linear_k(key).view(n_batch, -1, self.h, self.d_k)
        v = self.linear_v(value).view(n_batch, -1, self.h, self.d_k)
        q = q.transpose([0, 2, 1, 3])  # (batch, head, time1, d_k)
        k = k.transpose([0, 2, 1, 3])  # (batch, head, time2, d_k)
        v = v.transpose([0, 2, 1, 3])  # (batch, head, time2, d_k)
        return q, k, v
    def forward_attention(self,
                          value: paddle.Tensor,
                          scores: paddle.Tensor,
                          mask: Optional[paddle.Tensor]) -> paddle.Tensor:
        """Compute attention context vector.
        Args:
            value (paddle.Tensor): Transformed value, size
                (#batch, n_head, time2, d_k).
            scores (paddle.Tensor): Attention score, size
                (#batch, n_head, time1, time2).
            mask (paddle.Tensor): Mask, size (#batch, 1, time2) or
                (#batch, time1, time2).
        Returns:
            paddle.Tensor: Transformed value weighted 
                by the attention score, (#batch, time1, d_model).
        """
        n_batch = value.size(0)
        if mask is not None:
            mask = mask.unsqueeze(1).eq(0)  # (batch, 1, *, time2)
            scores = scores.masked_fill(mask, -float('inf'))
            attn = paddle.softmax(
                scores, axis=-1).masked_fill(mask,
                                             0.0)  # (batch, head, time1, time2)
        else:
            attn = paddle.softmax(
                scores, axis=-1)  # (batch, head, time1, time2)
        p_attn = self.dropout(attn)
        x = paddle.matmul(p_attn, value)  # (batch, head, time1, d_k)
        x = x.transpose([0, 2, 1, 3]).contiguous().view(
            n_batch, -1, self.h * self.d_k)  # (batch, time1, d_model)
        return self.linear_out(x)  # (batch, time1, d_model)
    def forward(self,
                query: paddle.Tensor,
                key: paddle.Tensor,
                value: paddle.Tensor,
                mask: Optional[paddle.Tensor]) -> paddle.Tensor:
        """Compute scaled dot product attention.
        Args:
            query (torch.Tensor): Query tensor (#batch, time1, size).
            key (torch.Tensor): Key tensor (#batch, time2, size).
            value (torch.Tensor): Value tensor (#batch, time2, size).
            mask (torch.Tensor): Mask tensor (#batch, 1, time2) or
                (#batch, time1, time2).
        Returns:
            torch.Tensor: Output tensor (#batch, time1, d_model).
        """
        q, k, v = self.forward_qkv(query, key, value)
        scores = paddle.matmul(q,
                               k.transpose([0, 1, 3, 2])) / math.sqrt(self.d_k)
        return self.forward_attention(v, scores, mask)
 class RelPositionMultiHeadedAttention(MultiHeadedAttention):
    """Multi-Head Attention layer with relative position encoding."""
    def __init__(self, n_head, n_feat, dropout_rate):
        """Construct an RelPositionMultiHeadedAttention object.
        Paper: https://arxiv.org/abs/1901.02860
        Args:
            n_head (int): The number of heads.
            n_feat (int): The number of features.
            dropout_rate (float): Dropout rate.
        """
        super().__init__(n_head, n_feat, dropout_rate)
        # linear transformation for positional encoding
        self.linear_pos = nn.Linear(n_feat, n_feat, bias_attr=False)
        # these two learnable bias are used in matrix c and matrix d
        # as described in https://arxiv.org/abs/1901.02860 Section 3.3
        #self.pos_bias_u = nn.Parameter(torch.Tensor(self.h, self.d_k))
        #self.pos_bias_v = nn.Parameter(torch.Tensor(self.h, self.d_k))
        #torch.nn.init.xavier_uniform_(self.pos_bias_u)
        #torch.nn.init.xavier_uniform_(self.pos_bias_v)
        pos_bias_u = self.create_parameter(
            [self.h, self.d_k], default_initializer=I.XavierUniform())
        self.add_parameter('pos_bias_u', pos_bias_u)
        pos_bias_v = self.create_parameter(
            (self.h, self.d_k), default_initializer=I.XavierUniform())
        self.add_parameter('pos_bias_v', pos_bias_v)
    def rel_shift(self, x, zero_triu: bool=False):
        """Compute relative positinal encoding.
        Args:
            x (paddle.Tensor): Input tensor (batch, head, time1, time1).
            zero_triu (bool): If true, return the lower triangular part of
                the matrix.
        Returns:
            paddle.Tensor: Output tensor. (batch, head, time1, time1)
        """
        zero_pad = paddle.zeros(
            (x.size(0), x.size(1), x.size(2), 1), dtype=x.dtype)
        x_padded = paddle.cat([zero_pad, x], dim=-1)
        x_padded = x_padded.view(x.size(0), x.size(1), x.size(3) + 1, x.size(2))
        x = x_padded[:, :, 1:].view_as(x)  # [B, H, T1, T1]
        if zero_triu:
            ones = paddle.ones((x.size(2), x.size(3)))
            x = x * paddle.tril(ones, x.size(3) - x.size(2))[None, None, :, :]
        return x
    def forward(self,
                query: paddle.Tensor,
                key: paddle.Tensor,
                value: paddle.Tensor,
                pos_emb: paddle.Tensor,
                mask: Optional[paddle.Tensor]):
        """Compute 'Scaled Dot Product Attention' with rel. positional encoding.
        Args:
            query (paddle.Tensor): Query tensor (#batch, time1, size).
            key (paddle.Tensor): Key tensor (#batch, time2, size).
            value (paddle.Tensor): Value tensor (#batch, time2, size).
            pos_emb (paddle.Tensor): Positional embedding tensor
                (#batch, time1, size).
            mask (paddle.Tensor): Mask tensor (#batch, 1, time2) or
                (#batch, time1, time2).
        Returns:
            paddle.Tensor: Output tensor (#batch, time1, d_model).
        """
        q, k, v = self.forward_qkv(query, key, value)
        q = q.transpose([0, 2, 1, 3])  # (batch, time1, head, d_k)
        n_batch_pos = pos_emb.size(0)
        p = self.linear_pos(pos_emb).view(n_batch_pos, -1, self.h, self.d_k)
        p = p.transpose([0, 2, 1, 3])  # (batch, head, time1, d_k)
        # (batch, head, time1, d_k)
        q_with_bias_u = (q + self.pos_bias_u).transpose([0, 2, 1, 3])
        # (batch, head, time1, d_k)
        q_with_bias_v = (q + self.pos_bias_v).transpose([0, 2, 1, 3])
        # compute attention score
        # first compute matrix a and matrix c
        # as described in https://arxiv.org/abs/1901.02860 Section 3.3
        # (batch, head, time1, time2)
        matrix_ac = paddle.matmul(q_with_bias_u, k.transpose([0, 1, 3, 2]))
        # compute matrix b and matrix d
        # (batch, head, time1, time2)
        matrix_bd = paddle.matmul(q_with_bias_v, p.transpose([0, 1, 3, 2]))
        # Remove rel_shift since it is useless in speech recognition,
        # and it requires special attention for streaming.
        # matrix_bd = self.rel_shift(matrix_bd)
        scores = (matrix_ac + matrix_bd) / math.sqrt(
            self.d_k)  # (batch, head, time1, time2)
        return self.forward_attention(v, scores, mask)
--- a/deepspeech/modules/cmvn.py
+++ b/deepspeech/modules/cmvn.py
@ -0,0 +1,51 @@
 # Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 import paddle
 from paddle import nn
 from deepspeech.utils.log import Log
 logger = Log(__name__).getlog()
 __all__ = ['GlobalCMVN']
 class GlobalCMVN(nn.Layer):
    def __init__(self,
                 mean: paddle.Tensor,
                 istd: paddle.Tensor,
                 norm_var: bool=True):
        """
        Args:
            mean (paddle.Tensor): mean stats
            istd (paddle.Tensor): inverse std, std which is 1.0 / std
        """
        super().__init__()
        assert mean.shape == istd.shape
        self.norm_var = norm_var
        # The buffer can be accessed from this module using self.mean
        self.register_buffer("mean", mean)
        self.register_buffer("istd", istd)
    def forward(self, x: paddle.Tensor):
        """
        Args:
            x (paddle.Tensor): (batch, max_len, feat_dim)
        Returns:
            (paddle.Tensor): normalized feature
        """
        x = x - self.mean
        if self.norm_var:
            x = x * self.istd
        return x
--- a/deepspeech/modules/conformer_convolution.py
+++ b/deepspeech/modules/conformer_convolution.py
@ -0,0 +1,161 @@
 # Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 """ConvolutionModule definition."""
 from typing import Optional
 from typing import Tuple
 import paddle
 from paddle import nn
 from typeguard import check_argument_types
 from deepspeech.utils.log import Log
 logger = Log(__name__).getlog()
 __all__ = ['ConvolutionModule']
 class ConvolutionModule(nn.Layer):
    """ConvolutionModule in Conformer model."""
    def __init__(self,
                 channels: int,
                 kernel_size: int=15,
                 activation: nn.Layer=nn.ReLU(),
                 norm: str="batch_norm",
                 causal: bool=False,
                 bias: bool=True):
        """Construct an ConvolutionModule object.
        Args:
            channels (int): The number of channels of conv layers.
            kernel_size (int): Kernel size of conv layers.
            activation (nn.Layer): Activation Layer.
            norm (str): Normalization type, 'batch_norm' or 'layer_norm'
            causal (bool): Whether use causal convolution or not
            bias (bool): Whether Conv with bias or not
        """
        assert check_argument_types()
        super().__init__()
        self.pointwise_conv1 = nn.Conv1D(
            channels,
            2 * channels,
            kernel_size=1,
            stride=1,
            padding=0,
            bias_attr=None
            if bias else False,  # None for True, using bias as default config
        )
        # self.lorder is used to distinguish if it's a causal convolution,
        # if self.lorder > 0: 
        #    it's a causal convolution, the input will be padded with 
        #    `self.lorder` frames on the left in forward (causal conv impl).
        # else: it's a symmetrical convolution
        if causal:
            padding = 0
            self.lorder = kernel_size - 1
        else:
            # kernel_size should be an odd number for none causal convolution
            assert (kernel_size - 1) % 2 == 0
            padding = (kernel_size - 1) // 2
            self.lorder = 0
        self.depthwise_conv = nn.Conv1D(
            channels,
            channels,
            kernel_size,
            stride=1,
            padding=padding,
            groups=channels,
            bias_attr=None
            if bias else False,  # None for True, using bias as default config
        )
        assert norm in ['batch_norm', 'layer_norm']
        if norm == "batch_norm":
            self.use_layer_norm = False
            self.norm = nn.BatchNorm1D(channels)
        else:
            self.use_layer_norm = True
            self.norm = nn.LayerNorm(channels)
        self.pointwise_conv2 = nn.Conv1D(
            channels,
            channels,
            kernel_size=1,
            stride=1,
            padding=0,
            bias_attr=None
            if bias else False,  # None for True, using bias as default config
        )
        self.activation = activation
    def forward(self,
                x: paddle.Tensor,
                mask_pad: Optional[paddle.Tensor]=None,
                cache: Optional[paddle.Tensor]=None
                ) -> Tuple[paddle.Tensor, paddle.Tensor]:
        """Compute convolution module.
        Args:
            x (paddle.Tensor): Input tensor (#batch, time, channels).
            mask_pad (paddle.Tensor): used for batch padding, (#batch, channels, time).
            cache (paddle.Tensor): left context cache, it is only
                used in causal convolution. (#batch, channels, time')
        Returns:
            paddle.Tensor: Output tensor (#batch, time, channels).
            paddle.Tensor: Output cache tensor (#batch, channels, time')
        """
        # exchange the temporal dimension and the feature dimension
        x = x.transpose([0, 2, 1])  # [B, C, T]
        # mask batch padding
        if mask_pad is not None:
            x = x.masked_fill(mask_pad, 0.0)
        if self.lorder > 0:
            if cache is None:
                x = nn.functional.pad(
                    x, (self.lorder, 0), 'constant', 0.0, data_format='NCL')
            else:
                assert cache.shape[0] == x.shape[0]  # B
                assert cache.shape[1] == x.shape[1]  # C
                x = paddle.concat((cache, x), axis=2)
            assert (x.shape[2] > self.lorder)
            new_cache = x[:, :, -self.lorder:]  #[B, C, T]
        else:
            # It's better we just return None if no cache is requried,
            # However, for JIT export, here we just fake one tensor instead of
            # None.
            new_cache = paddle.zeros([1], dtype=x.dtype)
        # GLU mechanism
        x = self.pointwise_conv1(x)  # (batch, 2*channel, dim)
        x = nn.functional.glu(x, axis=1)  # (batch, channel, dim)
        # 1D Depthwise Conv
        x = self.depthwise_conv(x)
        if self.use_layer_norm:
            x = x.transpose([0, 2, 1])  # [B, T, C]
        x = self.activation(self.norm(x))
        if self.use_layer_norm:
            x = x.transpose([0, 2, 1])  # [B, C, T]
        x = self.pointwise_conv2(x)
        # mask batch padding
        if mask_pad is not None:
            x = x.masked_fill(mask_pad, 0.0)
        x = x.transpose([0, 2, 1])  # [B, T, C]
        return x, new_cache
--- a/deepspeech/modules/conv.py
+++ b/deepspeech/modules/conv.py
@ -11,20 +11,41 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 import logging
 import paddle
 from paddle import nn
 from paddle.nn import functional as F
 from paddle.nn import initializer as I
 from deepspeech.modules.mask import sequence_mask
 from deepspeech.modules.activation import brelu
 from deepspeech.modules.mask import sequence_mask
 from deepspeech.utils.log import Log
 logger = Log(__name__).getlog()
 __all__ = ['ConvStack', "conv_output_size"]
 def conv_output_size(I, F, P, S):
    # https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-convolutional-neural-networks#hyperparameters
    # Output size after Conv:
    #   By noting I the length of the input volume size, 
    #   F the length of the filter, 
    #   P the amount of zero padding, 
    #   S the stride,
    #   then the output size O of the feature map along that dimension is given by:
    #       O = (I - F + Pstart + Pend) // S + 1
    #   When Pstart == Pend == P, we can replace Pstart + Pend by 2P.
    #   When Pstart == Pend == 0
    #       O = (I - F - S) // S
    # https://iq.opengenus.org/output-size-of-convolution/
    # Output height = (Input height + padding height top + padding height bottom - kernel height) / (stride height) + 1
    # Output width = (Output width + padding width right + padding width left - kernel width) / (stride width) + 1
    return (I - F + 2 * P - S) // S
 logger = logging.getLogger(__name__)
-__all__ = ['ConvStack']
+# receptive field calculator
 # https://fomoro.com/research/article/receptive-field-calculator
 # https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-convolutional-neural-networks#hyperparameters
 # https://distill.pub/2019/computing-receptive-fields/
 # Rl-1 = Sl * Rl + (Kl - Sl) 
 class ConvBn(nn.Layer):
@ -120,7 +141,7 @@ class ConvStack(nn.Layer):
            act='brelu')
        out_channel = 32
-        self.conv_stack = nn.LayerList([
+        convs = [
            ConvBn(
                num_channels_in=32,
                num_channels_out=out_channel,
@ -128,7 +149,8 @@ class ConvStack(nn.Layer):
                stride=(2, 1),
                padding=(10, 5),
                act='brelu') for i in range(num_stacks - 1)
-        ])
+        ]
        self.conv_stack = nn.LayerList(convs)
        # conv output feat_dim
        output_height = (feat_size - 1) // 2 + 1
--- a/deepspeech/modules/ctc.py
+++ b/deepspeech/modules/ctc.py
@ -11,38 +11,36 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 import logging
 from typeguard import check_argument_types
 import paddle
 from paddle import nn
 from paddle.nn import functional as F
-from paddle.nn import initializer as I
+from typeguard import check_argument_types
 from deepspeech.decoders.swig_wrapper import Scorer
 from deepspeech.decoders.swig_wrapper import ctc_greedy_decoder
 from deepspeech.decoders.swig_wrapper import ctc_beam_search_decoder_batch
 from deepspeech.decoders.swig_wrapper import ctc_greedy_decoder
 from deepspeech.decoders.swig_wrapper import Scorer
 from deepspeech.modules.loss import CTCLoss
 from deepspeech.utils import ctc_utils
 from deepspeech.utils.log import Log
-logger = logging.getLogger(__name__)
+logger = Log(__name__).getlog()
 __all__ = ['CTCDecoder']
 class CTCDecoder(nn.Layer):
    def __init__(self,
                 enc_n_units,
                 odim,
                 enc_n_units,
                 blank_id=0,
                 dropout_rate: float=0.0,
                 reduction: bool=True,
-                 batch_average: bool=False):
+                 batch_average: bool=True):
        """CTC decoder
        Args:
            odim ([int]): text vocabulary size
            enc_n_units ([int]): encoder output dimention
            vocab_size ([int]): text vocabulary size
            dropout_rate (float): dropout rate (0.0 ~ 1.0)
            reduction (bool): reduce the CTC loss into a scalar, True for 'sum' or 'none'
            batch_average (bool): do batch dim wise average.
@ -72,38 +70,31 @@ class CTCDecoder(nn.Layer):
            ys_pad (Tenosr): batch of padded character id sequence tensor (B, Lmax)
            ys_lens (Tensor): batch of lengths of character sequence (B)
        Returns:
-            loss (Tenosr): scalar.
+            loss (Tenosr): ctc loss value, scalar.
        """
        logits = self.ctc_lo(F.dropout(hs_pad, p=self.dropout_rate))
        loss = self.criterion(logits, ys_pad, hlens, ys_lens)
        return loss
-    def probs(self, eouts: paddle.Tensor, temperature: float=1.0):
+    def softmax(self, eouts: paddle.Tensor, temperature: float=1.0):
        """Get CTC probabilities.
        Args:
            eouts (FloatTensor): `[B, T, enc_units]`
        Returns:
            probs (FloatTensor): `[B, T, odim]`
        """
-        return F.softmax(self.ctc_lo(eouts) / temperature, axis=-1)
+        self.probs = F.softmax(self.ctc_lo(eouts) / temperature, axis=2)
        return self.probs
-    def scores(self, eouts: paddle.Tensor, temperature: float=1.0):
+    def log_softmax(self, hs_pad: paddle.Tensor,
-        """Get log-scale CTC probabilities.
+                    temperature: float=1.0) -> paddle.Tensor:
        Args:
            eouts (FloatTensor): `[B, T, enc_units]`
        Returns:
            log_probs (FloatTensor): `[B, T, odim]`
        """
        return F.log_softmax(self.ctc_lo(eouts) / temperature, axis=-1)
    def log_softmax(self, hs_pad: paddle.Tensor) -> paddle.Tensor:
        """log_softmax of frame activations
        Args:
            Tensor hs_pad: 3d tensor (B, Tmax, eprojs)
        Returns:
            paddle.Tensor: log softmax applied 3d tensor (B, Tmax, odim)
        """
-        return self.scores(hs_pad)
+        return F.log_softmax(self.ctc_lo(hs_pad) / temperature, axis=2)
    def argmax(self, hs_pad: paddle.Tensor) -> paddle.Tensor:
        """argmax of frame activations
@ -114,6 +105,20 @@ class CTCDecoder(nn.Layer):
        """
        return paddle.argmax(self.ctc_lo(hs_pad), dim=2)
    def forced_align(self,
                     ctc_probs: paddle.Tensor,
                     y: paddle.Tensor,
                     blank_id=0) -> list:
        """ctc forced alignment.
        Args:
            ctc_probs (paddle.Tensor): hidden state sequence, 2d tensor (T, D)
            y (paddle.Tensor): label id sequence tensor, 1d tensor (L)
            blank_id (int): blank symbol index
        Returns:
            paddle.Tensor: best alignment result, (T).
        """
        return ctc_utils.forced_align(ctc_probs, y, blank_id)
    def _decode_batch_greedy(self, probs_split, vocab_list):
        """Decode by best path for a batch of probs matrix input.
        :param probs_split: List of 2-D probability matrix, and each consists
@ -147,7 +152,7 @@ class CTCDecoder(nn.Layer):
        :type vocab_list: list
        """
        # init once
-        if self._ext_scorer != None:
+        if self._ext_scorer is not None:
            return
        if language_model_path != '':
@ -195,7 +200,7 @@ class CTCDecoder(nn.Layer):
        :return: List of transcription texts.
        :rtype: List of str
        """
-        if self._ext_scorer != None:
+        if self._ext_scorer is not None:
            self._ext_scorer.reset_params(beam_alpha, beam_beta)
        # beam search decode
@ -221,9 +226,28 @@ class CTCDecoder(nn.Layer):
    def decode_probs(self, probs, logits_lens, vocab_list, decoding_method,
                     lang_model_path, beam_alpha, beam_beta, beam_size,
                     cutoff_prob, cutoff_top_n, num_processes):
-        """ probs: activation after softmax 
+        """ctc decoding with probs.
-        logits_len: audio output lens
+
        Args:
            probs (Tenosr): activation after softmax 
            logits_lens (Tenosr): audio output lens
            vocab_list ([type]): [description]
            decoding_method ([type]): [description]
            lang_model_path ([type]): [description]
            beam_alpha ([type]): [description]
            beam_beta ([type]): [description]
            beam_size ([type]): [description]
            cutoff_prob ([type]): [description]
            cutoff_top_n ([type]): [description]
            num_processes ([type]): [description]
        Raises:
            ValueError: when decoding_method not support.
        Returns:
            List[str]: transcripts.
        """
        probs_split = [probs[i, :l, :] for i, l in enumerate(logits_lens)]
        if decoding_method == "ctc_greedy":
            result_transcripts = self._decode_batch_greedy(
--- a/deepspeech/modules/decoder.py
+++ b/deepspeech/modules/decoder.py
@ -0,0 +1,182 @@
 # Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 """Decoder definition."""
 from typing import List
 from typing import Optional
 from typing import Tuple
 import paddle
 from paddle import nn
 from typeguard import check_argument_types
 from deepspeech.modules.attention import MultiHeadedAttention
 from deepspeech.modules.decoder_layer import DecoderLayer
 from deepspeech.modules.embedding import PositionalEncoding
 from deepspeech.modules.mask import make_non_pad_mask
 from deepspeech.modules.mask import subsequent_mask
 from deepspeech.modules.positionwise_feed_forward import PositionwiseFeedForward
 from deepspeech.utils.log import Log
 logger = Log(__name__).getlog()
 __all__ = ["TransformerDecoder"]
 class TransformerDecoder(nn.Module):
    """Base class of Transfomer decoder module.
    Args:
        vocab_size: output dim
        encoder_output_size: dimension of attention
        attention_heads: the number of heads of multi head attention
        linear_units: the hidden units number of position-wise feedforward
        num_blocks: the number of decoder blocks
        dropout_rate: dropout rate
        self_attention_dropout_rate: dropout rate for attention
        input_layer: input layer type, `embed`
        use_output_layer: whether to use output layer
        pos_enc_class: PositionalEncoding module
        normalize_before:
            True: use layer_norm before each sub-block of a layer.
            False: use layer_norm after each sub-block of a layer.
        concat_after: whether to concat attention layer's input and output
            True: x -> x + linear(concat(x, att(x)))
            False: x -> x + att(x)
    """
    def __init__(
            self,
            vocab_size: int,
            encoder_output_size: int,
            attention_heads: int=4,
            linear_units: int=2048,
            num_blocks: int=6,
            dropout_rate: float=0.1,
            positional_dropout_rate: float=0.1,
            self_attention_dropout_rate: float=0.0,
            src_attention_dropout_rate: float=0.0,
            input_layer: str="embed",
            use_output_layer: bool=True,
            normalize_before: bool=True,
            concat_after: bool=False, ):
        assert check_argument_types()
        super().__init__()
        attention_dim = encoder_output_size
        if input_layer == "embed":
            self.embed = nn.Sequential(
                nn.Embedding(vocab_size, attention_dim),
                PositionalEncoding(attention_dim, positional_dropout_rate), )
        else:
            raise ValueError(f"only 'embed' is supported: {input_layer}")
        self.normalize_before = normalize_before
        self.after_norm = nn.LayerNorm(attention_dim, epsilon=1e-12)
        self.use_output_layer = use_output_layer
        self.output_layer = nn.Linear(attention_dim, vocab_size)
        self.decoders = nn.ModuleList([
            DecoderLayer(
                size=attention_dim,
                self_attn=MultiHeadedAttention(attention_heads, attention_dim,
                                               self_attention_dropout_rate),
                src_attn=MultiHeadedAttention(attention_heads, attention_dim,
                                              src_attention_dropout_rate),
                feed_forward=PositionwiseFeedForward(
                    attention_dim, linear_units, dropout_rate),
                dropout_rate=dropout_rate,
                normalize_before=normalize_before,
                concat_after=concat_after, ) for _ in range(num_blocks)
        ])
    def forward(
            self,
            memory: paddle.Tensor,
            memory_mask: paddle.Tensor,
            ys_in_pad: paddle.Tensor,
            ys_in_lens: paddle.Tensor, ) -> Tuple[paddle.Tensor, paddle.Tensor]:
        """Forward decoder.
        Args:
            memory: encoded memory, float32  (batch, maxlen_in, feat)
            memory_mask: encoder memory mask, (batch, 1, maxlen_in)
            ys_in_pad: padded input token ids, int64 (batch, maxlen_out)
            ys_in_lens: input lengths of this batch (batch)
        Returns:
            (tuple): tuple containing:
                x: decoded token score before softmax (batch, maxlen_out, vocab_size)
                    if use_output_layer is True,
                olens: (batch, )
        """
        tgt = ys_in_pad
        # tgt_mask: (B, 1, L)
        tgt_mask = (make_non_pad_mask(ys_in_lens).unsqueeze(1))
        # m: (1, L, L)
        m = subsequent_mask(tgt_mask.size(-1)).unsqueeze(0)
        # tgt_mask: (B, L, L)
        # TODO(Hui Zhang): not support & for tensor
        # tgt_mask = tgt_mask & m
        tgt_mask = tgt_mask.logical_and(m)
        x, _ = self.embed(tgt)
        for layer in self.decoders:
            x, tgt_mask, memory, memory_mask = layer(x, tgt_mask, memory,
                                                     memory_mask)
        if self.normalize_before:
            x = self.after_norm(x)
        if self.use_output_layer:
            x = self.output_layer(x)
        # TODO(Hui Zhang): reduce_sum not support bool type
        # olens = tgt_mask.sum(1)
        olens = tgt_mask.astype(paddle.int).sum(1)
        return x, olens
    def forward_one_step(
            self,
            memory: paddle.Tensor,
            memory_mask: paddle.Tensor,
            tgt: paddle.Tensor,
            tgt_mask: paddle.Tensor,
            cache: Optional[List[paddle.Tensor]]=None,
    ) -> Tuple[paddle.Tensor, List[paddle.Tensor]]:
        """Forward one step.
            This is only used for decoding.
        Args:
            memory: encoded memory, float32  (batch, maxlen_in, feat)
            memory_mask: encoded memory mask, (batch, 1, maxlen_in)
            tgt: input token ids, int64 (batch, maxlen_out)
            tgt_mask: input token mask,  (batch, maxlen_out, maxlen_out)
                      dtype=paddle.bool
            cache: cached output list of (batch, max_time_out-1, size)
        Returns:
            y, cache: NN output value and cache per `self.decoders`.
                y.shape` is (batch, token)
        """
        x, _ = self.embed(tgt)
        new_cache = []
        for i, decoder in enumerate(self.decoders):
            if cache is None:
                c = None
            else:
                c = cache[i]
            x, tgt_mask, memory, memory_mask = decoder(
                x, tgt_mask, memory, memory_mask, cache=c)
            new_cache.append(x)
        if self.normalize_before:
            y = self.after_norm(x[:, -1])
        else:
            y = x[:, -1]
        if self.use_output_layer:
            y = paddle.log_softmax(self.output_layer(y), axis=-1)
        return y, new_cache
--- a/deepspeech/modules/decoder_layer.py
+++ b/deepspeech/modules/decoder_layer.py
@ -0,0 +1,151 @@
 # Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 """Decoder self-attention layer definition."""
 from typing import Optional
 from typing import Tuple
 import paddle
 from paddle import nn
 from deepspeech.utils.log import Log
 logger = Log(__name__).getlog()
 __all__ = ["DecoderLayer"]
 class DecoderLayer(nn.Module):
    """Single decoder layer module.
    Args:
        size (int): Input dimension.
        self_attn (nn.Module): Self-attention module instance.
            `MultiHeadedAttention` instance can be used as the argument.
        src_attn (nn.Module): Self-attention module instance.
            `MultiHeadedAttention` instance can be used as the argument.
        feed_forward (nn.Module): Feed-forward module instance.
            `PositionwiseFeedForward` instance can be used as the argument.
        dropout_rate (float): Dropout rate.
        normalize_before (bool):
            True: use layer_norm before each sub-block.
            False: to use layer_norm after each sub-block.
        concat_after (bool): Whether to concat attention layer's input
            and output.
            True: x -> x + linear(concat(x, att(x)))
            False: x -> x + att(x)
    """
    def __init__(
            self,
            size: int,
            self_attn: nn.Module,
            src_attn: nn.Module,
            feed_forward: nn.Module,
            dropout_rate: float,
            normalize_before: bool=True,
            concat_after: bool=False, ):
        """Construct an DecoderLayer object."""
        super().__init__()
        self.size = size
        self.self_attn = self_attn
        self.src_attn = src_attn
        self.feed_forward = feed_forward
        self.norm1 = nn.LayerNorm(size, epsilon=1e-12)
        self.norm2 = nn.LayerNorm(size, epsilon=1e-12)
        self.norm3 = nn.LayerNorm(size, epsilon=1e-12)
        self.dropout = nn.Dropout(dropout_rate)
        self.normalize_before = normalize_before
        self.concat_after = concat_after
        self.concat_linear1 = nn.Linear(size + size, size)
        self.concat_linear2 = nn.Linear(size + size, size)
    def forward(
            self,
            tgt: paddle.Tensor,
            tgt_mask: paddle.Tensor,
            memory: paddle.Tensor,
            memory_mask: paddle.Tensor,
            cache: Optional[paddle.Tensor]=None
    ) -> Tuple[paddle.Tensor, paddle.Tensor, paddle.Tensor, paddle.Tensor]:
        """Compute decoded features.
        Args:
            tgt (paddle.Tensor): Input tensor (#batch, maxlen_out, size).
            tgt_mask (paddle.Tensor): Mask for input tensor
                (#batch, maxlen_out).
            memory (paddle.Tensor): Encoded memory
                (#batch, maxlen_in, size).
            memory_mask (paddle.Tensor): Encoded memory mask
                (#batch, maxlen_in).
            cache (paddle.Tensor): cached tensors.
                (#batch, maxlen_out - 1, size).
        Returns:
            paddle.Tensor: Output tensor (#batch, maxlen_out, size).
            paddle.Tensor: Mask for output tensor (#batch, maxlen_out).
            paddle.Tensor: Encoded memory (#batch, maxlen_in, size).
            paddle.Tensor: Encoded memory mask (#batch, maxlen_in).
        """
        residual = tgt
        if self.normalize_before:
            tgt = self.norm1(tgt)
        if cache is None:
            tgt_q = tgt
            tgt_q_mask = tgt_mask
        else:
            # compute only the last frame query keeping dim: max_time_out -> 1
            assert cache.shape == [
                tgt.shape[0],
                tgt.shape[1] - 1,
                self.size,
            ], f"{cache.shape} == {[tgt.shape[0], tgt.shape[1] - 1, self.size]}"
            tgt_q = tgt[:, -1:, :]
            residual = residual[:, -1:, :]
            # TODO(Hui Zhang): slice not support bool type
            # tgt_q_mask = tgt_mask[:, -1:, :]
            tgt_q_mask = tgt_mask.cast(paddle.int64)[:, -1:, :].cast(
                paddle.bool)
        if self.concat_after:
            tgt_concat = paddle.cat(
                (tgt_q, self.self_attn(tgt_q, tgt, tgt, tgt_q_mask)), dim=-1)
            x = residual + self.concat_linear1(tgt_concat)
        else:
            x = residual + self.dropout(
                self.self_attn(tgt_q, tgt, tgt, tgt_q_mask))
        if not self.normalize_before:
            x = self.norm1(x)
        residual = x
        if self.normalize_before:
            x = self.norm2(x)
        if self.concat_after:
            x_concat = paddle.cat(
                (x, self.src_attn(x, memory, memory, memory_mask)), dim=-1)
            x = residual + self.concat_linear2(x_concat)
        else:
            x = residual + self.dropout(
                self.src_attn(x, memory, memory, memory_mask))
        if not self.normalize_before:
            x = self.norm2(x)
        residual = x
        if self.normalize_before:
            x = self.norm3(x)
        x = residual + self.dropout(self.feed_forward(x))
        if not self.normalize_before:
            x = self.norm3(x)
        if cache is not None:
            x = paddle.cat([cache, x], dim=1)
        return x, tgt_mask, memory, memory_mask
--- a/deepspeech/modules/embedding.py
+++ b/deepspeech/modules/embedding.py
@ -12,23 +12,17 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 """Positonal Encoding Module."""
 import math
 import logging
 import numpy as np
 from typing import Tuple
 import paddle
 from paddle import nn
 from paddle.nn import functional as F
 from paddle.nn import initializer as I
-logger = logging.getLogger(__name__)
+from deepspeech.utils.log import Log
-__all__ = ["PositionalEncoding", "RelPositionalEncoding"]
+logger = Log(__name__).getlog()
-# TODO(Hui Zhang): remove this hack
+__all__ = ["PositionalEncoding", "RelPositionalEncoding"]
 paddle.float32 = 'float32'
 class PositionalEncoding(nn.Layer):
@ -51,10 +45,10 @@ class PositionalEncoding(nn.Layer):
        self.max_len = max_len
        self.xscale = paddle.to_tensor(math.sqrt(self.d_model))
        self.dropout = nn.Dropout(p=dropout_rate)
-        self.pe = paddle.zeros(self.max_len, self.d_model)  #[T,D]
+        self.pe = paddle.zeros([self.max_len, self.d_model])  #[T,D]
        position = paddle.arange(
-            0, self.max_len, dtype=paddle.float32).unsqueeze(1)
+            0, self.max_len, dtype=paddle.float32).unsqueeze(1)  #[T, 1]
        div_term = paddle.exp(
            paddle.arange(0, self.d_model, 2, dtype=paddle.float32) *
            -(math.log(10000.0) / self.d_model))
@ -71,13 +65,11 @@ class PositionalEncoding(nn.Layer):
            offset (int): position offset
        Returns:
            paddle.Tensor: Encoded tensor. Its shape is (batch, time, ...)
-            paddle.Tensor: for compatibility to RelPositionalEncoding
+            paddle.Tensor: for compatibility to RelPositionalEncoding, (batch=1, time, ...)
        """
-        T = paddle.shape(x)[1]
+        T = x.shape[1]
-        assert offset + T < self.max_len
+        assert offset + x.size(1) < self.max_len
-        #assert offset + x.size(1) < self.max_len
+        #TODO(Hui Zhang): using T = x.size(1), __getitem__ not support Tensor
        #self.pe = self.pe.to(x.device)
        #pos_emb = self.pe[:, offset:offset + x.size(1)]
        pos_emb = self.pe[:, offset:offset + T]
        x = x * self.xscale + pos_emb
        return self.dropout(x), self.dropout(pos_emb)
@ -122,11 +114,8 @@ class RelPositionalEncoding(PositionalEncoding):
            paddle.Tensor: Encoded tensor (batch, time, `*`).
            paddle.Tensor: Positional embedding tensor (1, time, `*`).
        """
-        T = paddle.shape()[1]
+        assert offset + x.size(1) < self.max_len
        assert offset + T < self.max_len
        #assert offset + x.size(1) < self.max_len
        #self.pe = self.pe.to(x.device)
        x = x * self.xscale
-        #pos_emb = self.pe[:, offset:offset + x.size(1)]
+        #TODO(Hui Zhang): using x.size(1), __getitem__ not support Tensor
-        pos_emb = self.pe[:, offset:offset + T]
+        pos_emb = self.pe[:, offset:offset + x.shape[1]]
        return self.dropout(x), self.dropout(pos_emb)
--- a/deepspeech/modules/encoder.py
+++ b/deepspeech/modules/encoder.py
@ -0,0 +1,448 @@
 # Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 """Encoder definition."""
 from typing import List
 from typing import Optional
 from typing import Tuple
 import paddle
 from paddle import nn
 from typeguard import check_argument_types
 from deepspeech.modules.activation import get_activation
 from deepspeech.modules.attention import MultiHeadedAttention
 from deepspeech.modules.attention import RelPositionMultiHeadedAttention
 from deepspeech.modules.conformer_convolution import ConvolutionModule
 from deepspeech.modules.embedding import PositionalEncoding
 from deepspeech.modules.embedding import RelPositionalEncoding
 from deepspeech.modules.encoder_layer import ConformerEncoderLayer
 from deepspeech.modules.encoder_layer import TransformerEncoderLayer
 from deepspeech.modules.mask import add_optional_chunk_mask
 from deepspeech.modules.mask import make_non_pad_mask
 from deepspeech.modules.positionwise_feed_forward import PositionwiseFeedForward
 from deepspeech.modules.subsampling import Conv2dSubsampling4
 from deepspeech.modules.subsampling import Conv2dSubsampling6
 from deepspeech.modules.subsampling import Conv2dSubsampling8
 from deepspeech.modules.subsampling import LinearNoSubsampling
 from deepspeech.utils.log import Log
 logger = Log(__name__).getlog()
 __all__ = ["BaseEncoder", 'TransformerEncoder', "ConformerEncoder"]
 class BaseEncoder(nn.Layer):
    def __init__(
            self,
            input_size: int,
            output_size: int=256,
            attention_heads: int=4,
            linear_units: int=2048,
            num_blocks: int=6,
            dropout_rate: float=0.1,
            positional_dropout_rate: float=0.1,
            attention_dropout_rate: float=0.0,
            input_layer: str="conv2d",
            pos_enc_layer_type: str="abs_pos",
            normalize_before: bool=True,
            concat_after: bool=False,
            static_chunk_size: int=0,
            use_dynamic_chunk: bool=False,
            global_cmvn: paddle.nn.Layer=None,
            use_dynamic_left_chunk: bool=False, ):
        """
        Args:
            input_size (int): input dim, d_feature
            output_size (int): dimension of attention, d_model
            attention_heads (int): the number of heads of multi head attention
            linear_units (int): the hidden units number of position-wise feed
                forward
            num_blocks (int): the number of encoder blocks
            dropout_rate (float): dropout rate
            attention_dropout_rate (float): dropout rate in attention
            positional_dropout_rate (float): dropout rate after adding
                positional encoding
            input_layer (str): input layer type.
                optional [linear, conv2d, conv2d6, conv2d8]
            pos_enc_layer_type (str): Encoder positional encoding layer type.
                opitonal [abs_pos, scaled_abs_pos, rel_pos]
            normalize_before (bool):
                True: use layer_norm before each sub-block of a layer.
                False: use layer_norm after each sub-block of a layer.
            concat_after (bool): whether to concat attention layer's input
                and output.
                True: x -> x + linear(concat(x, att(x)))
                False: x -> x + att(x)
            static_chunk_size (int): chunk size for static chunk training and
                decoding
            use_dynamic_chunk (bool): whether use dynamic chunk size for
                training or not, You can only use fixed chunk(chunk_size > 0)
                or dyanmic chunk size(use_dynamic_chunk = True)
            global_cmvn (Optional[paddle.nn.Layer]): Optional GlobalCMVN layer
            use_dynamic_left_chunk (bool): whether use dynamic left chunk in
                dynamic chunk training
        """
        assert check_argument_types()
        super().__init__()
        self._output_size = output_size
        if pos_enc_layer_type == "abs_pos":
            pos_enc_class = PositionalEncoding
        elif pos_enc_layer_type == "rel_pos":
            pos_enc_class = RelPositionalEncoding
        else:
            raise ValueError("unknown pos_enc_layer: " + pos_enc_layer_type)
        if input_layer == "linear":
            subsampling_class = LinearNoSubsampling
        elif input_layer == "conv2d":
            subsampling_class = Conv2dSubsampling4
        elif input_layer == "conv2d6":
            subsampling_class = Conv2dSubsampling6
        elif input_layer == "conv2d8":
            subsampling_class = Conv2dSubsampling8
        else:
            raise ValueError("unknown input_layer: " + input_layer)
        self.global_cmvn = global_cmvn
        self.embed = subsampling_class(
            idim=input_size,
            odim=output_size,
            dropout_rate=dropout_rate,
            pos_enc_class=pos_enc_class(
                d_model=output_size, dropout_rate=positional_dropout_rate), )
        self.normalize_before = normalize_before
        self.after_norm = nn.LayerNorm(output_size, epsilon=1e-12)
        self.static_chunk_size = static_chunk_size
        self.use_dynamic_chunk = use_dynamic_chunk
        self.use_dynamic_left_chunk = use_dynamic_left_chunk
    def output_size(self) -> int:
        return self._output_size
    def forward(
            self,
            xs: paddle.Tensor,
            xs_lens: paddle.Tensor,
            decoding_chunk_size: int=0,
            num_decoding_left_chunks: int=-1,
    ) -> Tuple[paddle.Tensor, paddle.Tensor]:
        """Embed positions in tensor.
        Args:
            xs: padded input tensor (B, L, D)
            xs_lens: input length (B)
            decoding_chunk_size: decoding chunk size for dynamic chunk
                0: default for training, use random dynamic chunk.
                <0: for decoding, use full chunk.
                >0: for decoding, use fixed chunk size as set.
            num_decoding_left_chunks: number of left chunks, this is for decoding,
                the chunk size is decoding_chunk_size.
                >=0: use num_decoding_left_chunks
                <0: use all left chunks
        Returns:
            encoder output tensor, lens and mask
        """
        masks = make_non_pad_mask(xs_lens).unsqueeze(1)  # (B, 1, L)
        if self.global_cmvn is not None:
            xs = self.global_cmvn(xs)
        #TODO(Hui Zhang): self.embed(xs, masks, offset=0), stride_slice not support bool tensor
        xs, pos_emb, masks = self.embed(xs, masks.type_as(xs), offset=0)
        #TODO(Hui Zhang): remove mask.astype, stride_slice not support bool tensor
        masks = masks.astype(paddle.bool)
        #TODO(Hui Zhang): mask_pad = ~masks
        mask_pad = masks.logical_not()
        chunk_masks = add_optional_chunk_mask(
            xs, masks, self.use_dynamic_chunk, self.use_dynamic_left_chunk,
            decoding_chunk_size, self.static_chunk_size,
            num_decoding_left_chunks)
        for layer in self.encoders:
            xs, chunk_masks, _ = layer(xs, chunk_masks, pos_emb, mask_pad)
        if self.normalize_before:
            xs = self.after_norm(xs)
        # Here we assume the mask is not changed in encoder layers, so just
        # return the masks before encoder layers, and the masks will be used
        # for cross attention with decoder later
        return xs, masks
    def forward_chunk(
            self,
            xs: paddle.Tensor,
            offset: int,
            required_cache_size: int,
            subsampling_cache: Optional[paddle.Tensor]=None,
            elayers_output_cache: Optional[List[paddle.Tensor]]=None,
            conformer_cnn_cache: Optional[List[paddle.Tensor]]=None,
    ) -> Tuple[paddle.Tensor, paddle.Tensor, List[paddle.Tensor], List[
            paddle.Tensor]]:
        """ Forward just one chunk
        Args:
            xs (paddle.Tensor): chunk input, [B=1, T, D]
            offset (int): current offset in encoder output time stamp
            required_cache_size (int): cache size required for next chunk
                compuation
                >=0: actual cache size
                <0: means all history cache is required
            subsampling_cache (Optional[paddle.Tensor]): subsampling cache
            elayers_output_cache (Optional[List[paddle.Tensor]]):
                transformer/conformer encoder layers output cache
            conformer_cnn_cache (Optional[List[paddle.Tensor]]): conformer
                cnn cache
        Returns:
            paddle.Tensor: output of current input xs
            paddle.Tensor: subsampling cache required for next chunk computation
            List[paddle.Tensor]: encoder layers output cache required for next
                chunk computation
            List[paddle.Tensor]: conformer cnn cache
        """
        assert xs.size(0) == 1  # batch size must be one
        # tmp_masks is just for interface compatibility
        tmp_masks = paddle.ones([1, xs.size(1)], dtype=paddle.bool)
        tmp_masks = tmp_masks.unsqueeze(1)  #[B=1, C=1, T]
        if self.global_cmvn is not None:
            xs = self.global_cmvn(xs)
        xs, pos_emb, _ = self.embed(
            xs, tmp_masks, offset=offset)  #xs=(B, T, D), pos_emb=(B=1, T, D)
        if subsampling_cache is not None:
            cache_size = subsampling_cache.size(1)  #T
            xs = paddle.cat((subsampling_cache, xs), dim=1)
        else:
            cache_size = 0
        pos_emb = self.embed.position_encoding(
            offset=offset - cache_size, size=xs.size(1))
        if required_cache_size < 0:
            next_cache_start = 0
        elif required_cache_size == 0:
            next_cache_start = xs.size(1)
        else:
            next_cache_start = xs.size(1) - required_cache_size
        r_subsampling_cache = xs[:, next_cache_start:, :]
        # Real mask for transformer/conformer layers
        masks = paddle.ones([1, xs.size(1)], dtype=paddle.bool)
        masks = masks.unsqueeze(1)  #[B=1, C=1, T]
        r_elayers_output_cache = []
        r_conformer_cnn_cache = []
        for i, layer in enumerate(self.encoders):
            attn_cache = None if elayers_output_cache is None else elayers_output_cache[
                i]
            cnn_cache = None if conformer_cnn_cache is None else conformer_cnn_cache[
                i]
            xs, _, new_cnn_cache = layer(
                xs,
                masks,
                pos_emb,
                output_cache=attn_cache,
                cnn_cache=cnn_cache)
            r_elayers_output_cache.append(xs[:, next_cache_start:, :])
            r_conformer_cnn_cache.append(new_cnn_cache)
        if self.normalize_before:
            xs = self.after_norm(xs)
        return (xs[:, cache_size:, :], r_subsampling_cache,
                r_elayers_output_cache, r_conformer_cnn_cache)
    def forward_chunk_by_chunk(
            self,
            xs: paddle.Tensor,
            decoding_chunk_size: int,
            num_decoding_left_chunks: int=-1,
    ) -> Tuple[paddle.Tensor, paddle.Tensor]:
        """ Forward input chunk by chunk with chunk_size like a streaming
            fashion
        Here we should pay special attention to computation cache in the
        streaming style forward chunk by chunk. Three things should be taken
        into account for computation in the current network:
            1. transformer/conformer encoder layers output cache
            2. convolution in conformer
            3. convolution in subsampling
        However, we don't implement subsampling cache for:
            1. We can control subsampling module to output the right result by
               overlapping input instead of cache left context, even though it
               wastes some computation, but subsampling only takes a very
               small fraction of computation in the whole model.
            2. Typically, there are several covolution layers with subsampling
               in subsampling module, it is tricky and complicated to do cache
               with different convolution layers with different subsampling
               rate.
            3. Currently, nn.Sequential is used to stack all the convolution
               layers in subsampling, we need to rewrite it to make it work
               with cache, which is not prefered.
        Args:
            xs (paddle.Tensor): (1, max_len, dim)
            chunk_size (int): decoding chunk size.
            num_left_chunks (int): decoding with num left chunks.
        """
        assert decoding_chunk_size > 0
        # The model is trained by static or dynamic chunk
        assert self.static_chunk_size > 0 or self.use_dynamic_chunk
        # feature stride and window for `subsampling` module
        subsampling = self.embed.subsampling_rate
        context = self.embed.right_context + 1  # Add current frame
        stride = subsampling * decoding_chunk_size
        decoding_window = (decoding_chunk_size - 1) * subsampling + context
        num_frames = xs.size(1)
        required_cache_size = decoding_chunk_size * num_decoding_left_chunks
        subsampling_cache: Optional[paddle.Tensor] = None
        elayers_output_cache: Optional[List[paddle.Tensor]] = None
        conformer_cnn_cache: Optional[List[paddle.Tensor]] = None
        outputs = []
        offset = 0
        # Feed forward overlap input step by step
        for cur in range(0, num_frames - context + 1, stride):
            end = min(cur + decoding_window, num_frames)
            chunk_xs = xs[:, cur:end, :]
            (y, subsampling_cache, elayers_output_cache,
             conformer_cnn_cache) = self.forward_chunk(
                 chunk_xs, offset, required_cache_size, subsampling_cache,
                 elayers_output_cache, conformer_cnn_cache)
            outputs.append(y)
            offset += y.size(1)
        ys = paddle.cat(outputs, 1)
        # fake mask, just for jit script and compatibility with `forward` api
        masks = paddle.ones([1, ys.size(1)], dtype=paddle.bool)
        masks = masks.unsqueeze(1)
        return ys, masks
 class TransformerEncoder(BaseEncoder):
    """Transformer encoder module."""
    def __init__(
            self,
            input_size: int,
            output_size: int=256,
            attention_heads: int=4,
            linear_units: int=2048,
            num_blocks: int=6,
            dropout_rate: float=0.1,
            positional_dropout_rate: float=0.1,
            attention_dropout_rate: float=0.0,
            input_layer: str="conv2d",
            pos_enc_layer_type: str="abs_pos",
            normalize_before: bool=True,
            concat_after: bool=False,
            static_chunk_size: int=0,
            use_dynamic_chunk: bool=False,
            global_cmvn: nn.Layer=None,
            use_dynamic_left_chunk: bool=False, ):
        """ Construct TransformerEncoder
        See Encoder for the meaning of each parameter.
        """
        assert check_argument_types()
        super().__init__(input_size, output_size, attention_heads, linear_units,
                         num_blocks, dropout_rate, positional_dropout_rate,
                         attention_dropout_rate, input_layer,
                         pos_enc_layer_type, normalize_before, concat_after,
                         static_chunk_size, use_dynamic_chunk, global_cmvn,
                         use_dynamic_left_chunk)
        self.encoders = nn.ModuleList([
            TransformerEncoderLayer(
                size=output_size,
                self_attn=MultiHeadedAttention(attention_heads, output_size,
                                               attention_dropout_rate),
                feed_forward=PositionwiseFeedForward(output_size, linear_units,
                                                     dropout_rate),
                dropout_rate=dropout_rate,
                normalize_before=normalize_before,
                concat_after=concat_after) for _ in range(num_blocks)
        ])
 class ConformerEncoder(BaseEncoder):
    """Conformer encoder module."""
    def __init__(
            self,
            input_size: int,
            output_size: int=256,
            attention_heads: int=4,
            linear_units: int=2048,
            num_blocks: int=6,
            dropout_rate: float=0.1,
            positional_dropout_rate: float=0.1,
            attention_dropout_rate: float=0.0,
            input_layer: str="conv2d",
            pos_enc_layer_type: str="rel_pos",
            normalize_before: bool=True,
            concat_after: bool=False,
            static_chunk_size: int=0,
            use_dynamic_chunk: bool=False,
            global_cmvn: nn.Layer=None,
            use_dynamic_left_chunk: bool=False,
            positionwise_conv_kernel_size: int=1,
            macaron_style: bool=True,
            selfattention_layer_type: str="rel_selfattn",
            activation_type: str="swish",
            use_cnn_module: bool=True,
            cnn_module_kernel: int=15,
            causal: bool=False,
            cnn_module_norm: str="batch_norm", ):
        """Construct ConformerEncoder
        Args:
            input_size to use_dynamic_chunk, see in BaseEncoder
            positionwise_conv_kernel_size (int): Kernel size of positionwise
                conv1d layer.
            macaron_style (bool): Whether to use macaron style for
                positionwise layer.
            selfattention_layer_type (str): Encoder attention layer type,
                the parameter has no effect now, it's just for configure
                compatibility.
            activation_type (str): Encoder activation function type.
            use_cnn_module (bool): Whether to use convolution module.
            cnn_module_kernel (int): Kernel size of convolution module.
            causal (bool): whether to use causal convolution or not.
            cnn_module_norm (str): cnn conv norm type, Optional['batch_norm','layer_norm']
        """
        assert check_argument_types()
        super().__init__(input_size, output_size, attention_heads, linear_units,
                         num_blocks, dropout_rate, positional_dropout_rate,
                         attention_dropout_rate, input_layer,
                         pos_enc_layer_type, normalize_before, concat_after,
                         static_chunk_size, use_dynamic_chunk, global_cmvn,
                         use_dynamic_left_chunk)
        activation = get_activation(activation_type)
        # self-attention module definition
        encoder_selfattn_layer = RelPositionMultiHeadedAttention
        encoder_selfattn_layer_args = (attention_heads, output_size,
                                       attention_dropout_rate)
        # feed-forward module definition
        positionwise_layer = PositionwiseFeedForward
        positionwise_layer_args = (output_size, linear_units, dropout_rate,
                                   activation)
        # convolution module definition
        convolution_layer = ConvolutionModule
        convolution_layer_args = (output_size, cnn_module_kernel, activation,
                                  cnn_module_norm, causal)
        self.encoders = nn.ModuleList([
            ConformerEncoderLayer(
                size=output_size,
                self_attn=encoder_selfattn_layer(*encoder_selfattn_layer_args),
                feed_forward=positionwise_layer(*positionwise_layer_args),
                feed_forward_macaron=positionwise_layer(
                    *positionwise_layer_args) if macaron_style else None,
                conv_module=convolution_layer(*convolution_layer_args)
                if use_cnn_module else None,
                dropout_rate=dropout_rate,
                normalize_before=normalize_before,
                concat_after=concat_after) for _ in range(num_blocks)
        ])
--- a/deepspeech/modules/encoder_layer.py
+++ b/deepspeech/modules/encoder_layer.py
@ -0,0 +1,284 @@
 # Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 """Encoder self-attention layer definition."""
 from typing import Optional
 from typing import Tuple
 import paddle
 from paddle import nn
 from deepspeech.utils.log import Log
 logger = Log(__name__).getlog()
 __all__ = ["TransformerEncoderLayer", "ConformerEncoderLayer"]
 class TransformerEncoderLayer(nn.Layer):
    """Encoder layer module."""
    def __init__(
            self,
            size: int,
            self_attn: nn.Layer,
            feed_forward: nn.Layer,
            dropout_rate: float,
            normalize_before: bool=True,
            concat_after: bool=False, ):
        """Construct an EncoderLayer object.
        Args:
            size (int): Input dimension.
            self_attn (nn.Layer): Self-attention module instance.
                `MultiHeadedAttention` or `RelPositionMultiHeadedAttention`
                instance can be used as the argument.
            feed_forward (nn.Layer): Feed-forward module instance.
                `PositionwiseFeedForward`, instance can be used as the argument.
            dropout_rate (float): Dropout rate.
            normalize_before (bool):
                True: use layer_norm before each sub-block.
                False: to use layer_norm after each sub-block.
            concat_after (bool): Whether to concat attention layer's input and
                output.
                True: x -> x + linear(concat(x, att(x)))
                False: x -> x + att(x)
        """
        super().__init__()
        self.self_attn = self_attn
        self.feed_forward = feed_forward
        self.norm1 = nn.LayerNorm(size, epsilon=1e-12)
        self.norm2 = nn.LayerNorm(size, epsilon=1e-12)
        self.dropout = nn.Dropout(dropout_rate)
        self.size = size
        self.normalize_before = normalize_before
        self.concat_after = concat_after
        # concat_linear may be not used in forward fuction,
        # but will be saved in the *.pt
        self.concat_linear = nn.Linear(size + size, size)
    def forward(
            self,
            x: paddle.Tensor,
            mask: paddle.Tensor,
            pos_emb: paddle.Tensor,
            mask_pad: Optional[paddle.Tensor]=None,
            output_cache: Optional[paddle.Tensor]=None,
            cnn_cache: Optional[paddle.Tensor]=None,
    ) -> Tuple[paddle.Tensor, paddle.Tensor, paddle.Tensor]:
        """Compute encoded features.
        Args:
            x (paddle.Tensor): Input tensor (#batch, time, size).
            mask (paddle.Tensor): Mask tensor for the input (#batch, time).
            pos_emb (paddle.Tensor): just for interface compatibility
                to ConformerEncoderLayer
            mask_pad (paddle.Tensor): does not used in transformer layer,
                just for unified api with conformer.
            output_cache (paddle.Tensor): Cache tensor of the output
                (#batch, time2, size), time2 < time in x.
            cnn_cache (paddle.Tensor): not used here, it's for interface
                compatibility to ConformerEncoderLayer
        Returns:
            paddle.Tensor: Output tensor (#batch, time, size).
            paddle.Tensor: Mask tensor (#batch, time).
            paddle.Tensor: Fake cnn cache tensor for api compatibility with Conformer (#batch, channels, time').
        """
        residual = x
        if self.normalize_before:
            x = self.norm1(x)
        if output_cache is None:
            x_q = x
        else:
            assert output_cache.shape[0] == x.shape[0]
            assert output_cache.shape[1] < x.shape[1]
            assert output_cache.shape[2] == self.size
            chunk = x.shape[1] - output_cache.shape[1]
            x_q = x[:, -chunk:, :]
            residual = residual[:, -chunk:, :]
            mask = mask[:, -chunk:, :]
        if self.concat_after:
            x_concat = paddle.concat(
                (x, self.self_attn(x_q, x, x, mask)), axis=-1)
            x = residual + self.concat_linear(x_concat)
        else:
            x = residual + self.dropout(self.self_attn(x_q, x, x, mask))
        if not self.normalize_before:
            x = self.norm1(x)
        residual = x
        if self.normalize_before:
            x = self.norm2(x)
        x = residual + self.dropout(self.feed_forward(x))
        if not self.normalize_before:
            x = self.norm2(x)
        if output_cache is not None:
            x = paddle.concat([output_cache, x], axis=1)
        fake_cnn_cache = paddle.zeros([1], dtype=x.dtype)
        return x, mask, fake_cnn_cache
 class ConformerEncoderLayer(nn.Layer):
    """Encoder layer module."""
    def __init__(
            self,
            size: int,
            self_attn: nn.Layer,
            feed_forward: Optional[nn.Layer]=None,
            feed_forward_macaron: Optional[nn.Layer]=None,
            conv_module: Optional[nn.Layer]=None,
            dropout_rate: float=0.1,
            normalize_before: bool=True,
            concat_after: bool=False, ):
        """Construct an EncoderLayer object.
        Args:
            size (int): Input dimension.
            self_attn (nn.Layer): Self-attention module instance.
                `MultiHeadedAttention` or `RelPositionMultiHeadedAttention`
                instance can be used as the argument.
            feed_forward (nn.Layer): Feed-forward module instance.
                `PositionwiseFeedForward` instance can be used as the argument.
            feed_forward_macaron (nn.Layer): Additional feed-forward module
                instance.
                `PositionwiseFeedForward` instance can be used as the argument.
            conv_module (nn.Layer): Convolution module instance.
                `ConvlutionModule` instance can be used as the argument.
            dropout_rate (float): Dropout rate.
            normalize_before (bool):
                True: use layer_norm before each sub-block.
                False: use layer_norm after each sub-block.
            concat_after (bool): Whether to concat attention layer's input and
                output.
                True: x -> x + linear(concat(x, att(x)))
                False: x -> x + att(x)
        """
        super().__init__()
        self.self_attn = self_attn
        self.feed_forward = feed_forward
        self.feed_forward_macaron = feed_forward_macaron
        self.conv_module = conv_module
        self.norm_ff = nn.LayerNorm(size, epsilon=1e-12)  # for the FNN module
        self.norm_mha = nn.LayerNorm(size, epsilon=1e-12)  # for the MHA module
        if feed_forward_macaron is not None:
            self.norm_ff_macaron = nn.LayerNorm(size, epsilon=1e-12)
            self.ff_scale = 0.5
        else:
            self.ff_scale = 1.0
        if self.conv_module is not None:
            self.norm_conv = nn.LayerNorm(
                size, epsilon=1e-12)  # for the CNN module
            self.norm_final = nn.LayerNorm(
                size, epsilon=1e-12)  # for the final output of the block
        self.dropout = nn.Dropout(dropout_rate)
        self.size = size
        self.normalize_before = normalize_before
        self.concat_after = concat_after
        self.concat_linear = nn.Linear(size + size, size)
    def forward(
            self,
            x: paddle.Tensor,
            mask: paddle.Tensor,
            pos_emb: paddle.Tensor,
            mask_pad: Optional[paddle.Tensor]=None,
            output_cache: Optional[paddle.Tensor]=None,
            cnn_cache: Optional[paddle.Tensor]=None,
    ) -> Tuple[paddle.Tensor, paddle.Tensor, paddle.Tensor]:
        """Compute encoded features.
        Args:
            x (paddle.Tensor): (#batch, time, size)
            mask (paddle.Tensor): Mask tensor for the input (#batch, time，time).
            pos_emb (paddle.Tensor): positional encoding, must not be None
                for ConformerEncoderLayer.
            mask_pad (paddle.Tensor): batch padding mask used for conv module, (B, 1, T).
            output_cache (paddle.Tensor): Cache tensor of the encoder output
                (#batch, time2, size), time2 < time in x.
            cnn_cache (paddle.Tensor): Convolution cache in conformer layer
        Returns:
            paddle.Tensor: Output tensor (#batch, time, size).
            paddle.Tensor: Mask tensor (#batch, time).
            paddle.Tensor: New cnn cache tensor (#batch, channels, time').
        """
        # whether to use macaron style FFN
        if self.feed_forward_macaron is not None:
            residual = x
            if self.normalize_before:
                x = self.norm_ff_macaron(x)
            x = residual + self.ff_scale * self.dropout(
                self.feed_forward_macaron(x))
            if not self.normalize_before:
                x = self.norm_ff_macaron(x)
        # multi-headed self-attention module
        residual = x
        if self.normalize_before:
            x = self.norm_mha(x)
        if output_cache is None:
            x_q = x
        else:
            assert output_cache.shape[0] == x.shape[0]
            assert output_cache.shape[1] < x.shape[1]
            assert output_cache.shape[2] == self.size
            chunk = x.shape[1] - output_cache.shape[1]
            x_q = x[:, -chunk:, :]
            residual = residual[:, -chunk:, :]
            mask = mask[:, -chunk:, :]
        x_att = self.self_attn(x_q, x, x, pos_emb, mask)
        if self.concat_after:
            x_concat = paddle.concat((x, x_att), axis=-1)
            x = residual + self.concat_linear(x_concat)
        else:
            x = residual + self.dropout(x_att)
        if not self.normalize_before:
            x = self.norm_mha(x)
        # convolution module
        # Fake new cnn cache here, and then change it in conv_module
        new_cnn_cache = paddle.zeros([1], dtype=x.dtype)
        if self.conv_module is not None:
            residual = x
            if self.normalize_before:
                x = self.norm_conv(x)
            x, new_cnn_cache = self.conv_module(x, mask_pad, cnn_cache)
            x = residual + self.dropout(x)
            if not self.normalize_before:
                x = self.norm_conv(x)
        # feed forward module
        residual = x
        if self.normalize_before:
            x = self.norm_ff(x)
        x = residual + self.ff_scale * self.dropout(self.feed_forward(x))
        if not self.normalize_before:
            x = self.norm_ff(x)
        if self.conv_module is not None:
            x = self.norm_final(x)
        if output_cache is not None:
            x = paddle.concat([output_cache, x], axis=1)
        return x, mask, new_cnn_cache
--- a/deepspeech/modules/loss.py
+++ b/deepspeech/modules/loss.py
@ -11,45 +11,15 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 import logging
 import paddle
 from paddle import nn
 from paddle.nn import functional as F
 from paddle.nn import initializer as I
 logger = logging.getLogger(__name__)
 __all__ = ['CTCLoss']
 # TODO(Hui Zhang): remove this hack, when `norm_by_times=True` is added
 def ctc_loss(logits,
             labels,
             input_lengths,
             label_lengths,
             blank=0,
             reduction='mean',
             norm_by_times=True):
    #logger.info("my ctc loss with norm by times")
    ## https://github.com/PaddlePaddle/Paddle/blob/f5ca2db2cc/paddle/fluid/operators/warpctc_op.h#L403
    loss_out = paddle.fluid.layers.warpctc(logits, labels, blank, norm_by_times,
                                           input_lengths, label_lengths)
-    loss_out = paddle.fluid.layers.squeeze(loss_out, [-1])
+from deepspeech.utils.log import Log
    logger.info(f"warpctc loss: {loss_out}/{loss_out.shape} ")
    assert reduction in ['mean', 'sum', 'none']
    if reduction == 'mean':
        loss_out = paddle.mean(loss_out / label_lengths)
    elif reduction == 'sum':
        loss_out = paddle.sum(loss_out)
    logger.info(f"ctc loss: {loss_out}")
    return loss_out
 logger = Log(__name__).getlog()
-# TODO(Hui Zhang): remove this hack
+__all__ = ['CTCLoss', "LabelSmoothingLoss"]
 F.ctc_loss = ctc_loss
 class CTCLoss(nn.Layer):
@ -76,8 +46,98 @@ class CTCLoss(nn.Layer):
        # warp-ctc need activation with shape [T, B, V + 1]
        # logits: (B, L, D) -> (L, B, D)
        logits = logits.transpose([1, 0, 2])
        # (TODO:Hui Zhang) ctc loss does not support int64 labels
        ys_pad = ys_pad.astype(paddle.int32)
        loss = self.loss(logits, ys_pad, hlens, ys_lens)
        if self.batch_average:
            # Batch-size average
            loss = loss / B
        return loss
 class LabelSmoothingLoss(nn.Layer):
    """Label-smoothing loss.
    In a standard CE loss, the label's data distribution is:
        [0,1,2] ->
        [
            [1.0, 0.0, 0.0],
            [0.0, 1.0, 0.0],
            [0.0, 0.0, 1.0],
        ]
    In the smoothing version CE Loss,some probabilities
    are taken from the true label prob (1.0) and are divided
    among other labels.
        e.g.
        smoothing=0.1
        [0,1,2] ->
        [
            [0.9, 0.05, 0.05],
            [0.05, 0.9, 0.05],
            [0.05, 0.05, 0.9],
        ]
    """
    def __init__(self,
                 size: int,
                 padding_idx: int,
                 smoothing: float,
                 normalize_length: bool=False):
        """Label-smoothing loss.
        Args:
            size (int): the number of class
            padding_idx (int): padding class id which will be ignored for loss
            smoothing (float): smoothing rate (0.0 means the conventional CE)
            normalize_length (bool): 
                True, normalize loss by sequence length; 
                False, normalize loss by batch size.
                Defaults to False.
        """
        super().__init__()
        self.size = size
        self.padding_idx = padding_idx
        self.smoothing = smoothing
        self.confidence = 1.0 - smoothing
        self.normalize_length = normalize_length
        self.criterion = nn.KLDivLoss(reduction="none")
    def forward(self, x: paddle.Tensor, target: paddle.Tensor) -> paddle.Tensor:
        """Compute loss between x and target.
        The model outputs and data labels tensors are flatten to
        (batch*seqlen, class) shape and a mask is applied to the
        padding part which should not be calculated for loss.
        Args:
            x (paddle.Tensor): prediction (batch, seqlen, class)
            target (paddle.Tensor):
                target signal masked with self.padding_id (batch, seqlen)
        Returns:
            loss (paddle.Tensor) : The KL loss, scalar float value
        """
        B, T, D = paddle.shape(x)
        assert D == self.size
        x = x.reshape((-1, self.size))
        target = target.reshape([-1])
        # use zeros_like instead of torch.no_grad() for true_dist,
        # since no_grad() can not be exported by JIT
        true_dist = paddle.full_like(x, self.smoothing / (self.size - 1))
        ignore = target == self.padding_idx  # (B,)
        # target = target * (1 - ignore)  # avoid -1 index
        target = target.masked_fill(ignore, 0)  # avoid -1 index
        # true_dist.scatter_(1, target.unsqueeze(1), self.confidence)
        target_mask = F.one_hot(target, self.size)
        true_dist *= (1 - target_mask)
        true_dist += target_mask * self.confidence
        kl = self.criterion(F.log_softmax(x, axis=1), true_dist)
        #TODO(Hui Zhang): sum not support bool type
        #total = len(target) - int(ignore.sum())
        total = len(target) - int(ignore.type_as(target).sum())
        denom = total if self.normalize_length else B
        #numer = (kl * (1 - ignore)).sum()
        numer = kl.masked_fill(ignore.unsqueeze(1), 0).sum()
        return numer / denom
--- a/deepspeech/modules/mask.py
+++ b/deepspeech/modules/mask.py
@ -11,20 +11,37 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 import logging
 import paddle
 from paddle import nn
 from paddle.nn import functional as F
 from paddle.nn import initializer as I
-logger = logging.getLogger(__name__)
+from deepspeech.utils.log import Log
 logger = Log(__name__).getlog()
-__all__ = ['sequence_mask']
+__all__ = [
    'sequence_mask', "make_pad_mask", "make_non_pad_mask", "subsequent_mask",
    "subsequent_chunk_mask", "add_optional_chunk_mask", "mask_finished_scores",
    "mask_finished_preds"
 ]
 def sequence_mask(x_len, max_len=None, dtype='float32'):
    """batch sequence mask.
    Args:
        x_len ([paddle.Tensor]): xs lenght, [B]
        max_len ([type], optional): max sequence length. Defaults to None.
        dtype (str, optional): mask data type. Defaults to 'float32'.
    Returns:
        paddle.Tensor: [B, Tmax]
     Examples:
        >>> sequence_mask([2, 4])
        [[1., 1., 0., 0.],
         [1., 1., 1., 1.]]
    """
    # (TODO: Hui Zhang): jit not support Tenosr.dim() and Tensor.ndim
    # assert x_len.dim() == 1, (x_len.dim(), x_len)
    max_len = max_len or x_len.max()
    x_len = paddle.unsqueeze(x_len, -1)
    row_vector = paddle.arange(max_len)
@ -33,3 +50,236 @@ def sequence_mask(x_len, max_len=None, dtype='float32'):
    mask = row_vector > x_len  # a bug, broadcast 的时候出错了
    mask = paddle.cast(mask, dtype)
    return mask
 def make_pad_mask(lengths: paddle.Tensor) -> paddle.Tensor:
    """Make mask tensor containing indices of padded part.
    See description of make_non_pad_mask.
    Args:
        lengths (paddle.Tensor): Batch of lengths (B,).
    Returns:
        paddle.Tensor: Mask tensor containing indices of padded part.
    Examples:
        >>> lengths = [5, 3, 2]
        >>> make_pad_mask(lengths)
        masks = [[0, 0, 0, 0 ,0],
                 [0, 0, 0, 1, 1],
                 [0, 0, 1, 1, 1]]
    """
    assert lengths.dim() == 1
    batch_size = int(lengths.shape[0])
    max_len = int(lengths.max())
    seq_range = paddle.arange(0, max_len, dtype=paddle.int64)
    seq_range_expand = seq_range.unsqueeze(0).expand([batch_size, max_len])
    seq_length_expand = lengths.unsqueeze(-1)
    mask = seq_range_expand >= seq_length_expand
    return mask
 def make_non_pad_mask(lengths: paddle.Tensor) -> paddle.Tensor:
    """Make mask tensor containing indices of non-padded part.
    The sequences in a batch may have different lengths. To enable
    batch computing, padding is need to make all sequence in same
    size. To avoid the padding part pass value to context dependent
    block such as attention or convolution , this padding part is
    masked.
    This pad_mask is used in both encoder and decoder.
    1 for non-padded part and 0 for padded part.
    Args:
        lengths (paddle.Tensor): Batch of lengths (B,).
    Returns:
        paddle.Tensor: mask tensor containing indices of padded part.
    Examples:
        >>> lengths = [5, 3, 2]
        >>> make_non_pad_mask(lengths)
        masks = [[1, 1, 1, 1 ,1],
                 [1, 1, 1, 0, 0],
                 [1, 1, 0, 0, 0]]
    """
    #TODO(Hui Zhang): return ~make_pad_mask(lengths), not support ~
    return make_pad_mask(lengths).logical_not()
 def subsequent_mask(size: int) -> paddle.Tensor:
    """Create mask for subsequent steps (size, size).
    This mask is used only in decoder which works in an auto-regressive mode.
    This means the current step could only do attention with its left steps.
    In encoder, fully attention is used when streaming is not necessary and
    the sequence is not long. In this case, no attention mask is needed.
    When streaming is need, chunk-based attention is used in encoder. See
    subsequent_chunk_mask for the chunk-based attention mask.
    Args:
        size (int): size of mask
    Returns:
        paddle.Tensor: mask, [size, size]
    Examples:
        >>> subsequent_mask(3)
        [[1, 0, 0],
         [1, 1, 0],
         [1, 1, 1]]
    """
    ret = paddle.ones([size, size], dtype=paddle.bool)
    #TODO(Hui Zhang): tril not support bool
    #return paddle.tril(ret)
    ret = ret.astype(paddle.float)
    ret = paddle.tril(ret)
    ret = ret.astype(paddle.bool)
    return ret
 def subsequent_chunk_mask(
        size: int,
        chunk_size: int,
        num_left_chunks: int=-1, ) -> paddle.Tensor:
    """Create mask for subsequent steps (size, size) with chunk size,
       this is for streaming encoder
    Args:
        size (int): size of mask
        chunk_size (int): size of chunk
        num_left_chunks (int): number of left chunks
            <0: use full chunk
            >=0: use num_left_chunks
    Returns:
        paddle.Tensor: mask, [size, size]
    Examples:
        >>> subsequent_chunk_mask(4, 2)
        [[1, 1, 0, 0],
         [1, 1, 0, 0],
         [1, 1, 1, 1],
         [1, 1, 1, 1]]
    """
    ret = torch.zeros([size, size], dtype=paddle.bool)
    for i in range(size):
        if num_left_chunks < 0:
            start = 0
        else:
            start = max(0, (i // chunk_size - num_left_chunks) * chunk_size)
        ending = min(size, (i // chunk_size + 1) * chunk_size)
        ret[i, start:ending] = True
    return ret
 def add_optional_chunk_mask(xs: paddle.Tensor,
                            masks: paddle.Tensor,
                            use_dynamic_chunk: bool,
                            use_dynamic_left_chunk: bool,
                            decoding_chunk_size: int,
                            static_chunk_size: int,
                            num_decoding_left_chunks: int):
    """ Apply optional mask for encoder.
    Args:
        xs (paddle.Tensor): padded input, (B, L, D), L for max length
        mask (paddle.Tensor): mask for xs, (B, 1, L)
        use_dynamic_chunk (bool): whether to use dynamic chunk or not
        use_dynamic_left_chunk (bool): whether to use dynamic left chunk for
            training.
        decoding_chunk_size (int): decoding chunk size for dynamic chunk, it's
            0: default for training, use random dynamic chunk.
            <0: for decoding, use full chunk.
            >0: for decoding, use fixed chunk size as set.
        static_chunk_size (int): chunk size for static chunk training/decoding
            if it's greater than 0, if use_dynamic_chunk is true,
            this parameter will be ignored
        num_decoding_left_chunks (int): number of left chunks, this is for decoding,
            the chunk size is decoding_chunk_size.
            >=0: use num_decoding_left_chunks
            <0: use all left chunks
    Returns:
        paddle.Tensor: chunk mask of the input xs.
    """
    # Whether to use chunk mask or not
    if use_dynamic_chunk:
        max_len = xs.shape[1]
        if decoding_chunk_size < 0:
            chunk_size = max_len
            num_left_chunks = -1
        elif decoding_chunk_size > 0:
            chunk_size = decoding_chunk_size
            num_left_chunks = num_decoding_left_chunks
        else:
            # chunk size is either [1, 25] or full context(max_len).
            # Since we use 4 times subsampling and allow up to 1s(100 frames)
            # delay, the maximum frame is 100 / 4 = 25.
            chunk_size = int(paddle.randint(1, max_len, (1, )))
            num_left_chunks = -1
            if chunk_size > max_len // 2:
                chunk_size = max_len
            else:
                chunk_size = chunk_size % 25 + 1
                if use_dynamic_left_chunk:
                    max_left_chunks = (max_len - 1) // chunk_size
                    num_left_chunks = int(
                        paddle.randint(0, max_left_chunks, (1, )))
        chunk_masks = subsequent_chunk_mask(xs.shape[1], chunk_size,
                                            num_left_chunks)  # (L, L)
        chunk_masks = chunk_masks.unsqueeze(0)  # (1, L, L)
        chunk_masks = masks & chunk_masks  # (B, L, L)
    elif static_chunk_size > 0:
        num_left_chunks = num_decoding_left_chunks
        chunk_masks = subsequent_chunk_mask(xs.shape[1], static_chunk_size,
                                            num_left_chunks)  # (L, L)
        chunk_masks = chunk_masks.unsqueeze(0)  # (1, L, L)
        chunk_masks = masks & chunk_masks  # (B, L, L)
    else:
        chunk_masks = masks
    return chunk_masks
 def mask_finished_scores(score: paddle.Tensor,
                         flag: paddle.Tensor) -> paddle.Tensor:
    """
    If a sequence is finished, we only allow one alive branch. This function
    aims to give one branch a zero score and the rest -inf score.
    Args:
        score (paddle.Tensor): A real value array with shape
            (batch_size * beam_size, beam_size).
        flag (paddle.Tensor): A bool array with shape
            (batch_size * beam_size, 1).
    Returns:
        paddle.Tensor: (batch_size * beam_size, beam_size).
    Examples:
        flag: tensor([[ True],
                      [False]])
        score: tensor([[-0.3666, -0.6664,  0.6019],
                       [-1.1490, -0.2948,  0.7460]])
        unfinished: tensor([[False,  True,  True],
                            [False, False, False]])
        finished: tensor([[ True, False, False],
                          [False, False, False]])
        return: tensor([[ 0.0000,    -inf,    -inf],
                        [-1.1490, -0.2948,  0.7460]])
    """
    beam_size = score.shape[-1]
    zero_mask = paddle.zeros_like(flag, dtype=paddle.bool)
    if beam_size > 1:
        unfinished = paddle.concat(
            (zero_mask, flag.tile([1, beam_size - 1])), axis=1)
        finished = paddle.concat(
            (flag, zero_mask.tile([1, beam_size - 1])), axis=1)
    else:
        unfinished = zero_mask
        finished = flag
    # infs = paddle.ones_like(score) * -float('inf')
    # score = paddle.where(unfinished, infs, score)
    # score = paddle.where(finished, paddle.zeros_like(score), score)
    score.masked_fill_(unfinished, -float('inf'))
    score.masked_fill_(finished, 0)
    return score
 def mask_finished_preds(pred: paddle.Tensor, flag: paddle.Tensor,
                        eos: int) -> paddle.Tensor:
    """
    If a sequence is finished, all of its branch should be <eos>
    Args:
        pred (paddle.Tensor): A int array with shape
            (batch_size * beam_size, beam_size).
        flag (paddle.Tensor): A bool array with shape
            (batch_size * beam_size, 1).
    Returns:
        paddle.Tensor: (batch_size * beam_size).
    """
    beam_size = pred.shape[-1]
    finished = flag.repeat(1, beam_size)
    return pred.masked_fill_(finished, eos)
--- a/deepspeech/modules/positionwise_feed_forward.py
+++ b/deepspeech/modules/positionwise_feed_forward.py
@ -0,0 +1,57 @@
 # Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 """Positionwise feed forward layer definition."""
 import paddle
 from paddle import nn
 from deepspeech.utils.log import Log
 logger = Log(__name__).getlog()
 __all__ = ["PositionwiseFeedForward"]
 class PositionwiseFeedForward(nn.Layer):
    """Positionwise feed forward layer."""
    def __init__(self,
                 idim: int,
                 hidden_units: int,
                 dropout_rate: float,
                 activation: nn.Layer=nn.ReLU()):
        """Construct a PositionwiseFeedForward object.
        FeedForward are appied on each position of the sequence.
        The output dim is same with the input dim.
        Args:
            idim (int): Input dimenstion.
            hidden_units (int): The number of hidden units.
            dropout_rate (float): Dropout rate.
            activation (paddle.nn.Layer): Activation function
        """
        super().__init__()
        self.w_1 = nn.Linear(idim, hidden_units)
        self.activation = activation
        self.dropout = nn.Dropout(dropout_rate)
        self.w_2 = nn.Linear(hidden_units, idim)
    def forward(self, xs: paddle.Tensor) -> paddle.Tensor:
        """Forward function.
        Args:
            xs: input tensor (B, Lmax, D)
        Returns:
            output tensor, (B, Lmax, D)
        """
        return self.w_2(self.dropout(self.activation(self.w_1(xs))))
--- a/deepspeech/modules/rnn.py
+++ b/deepspeech/modules/rnn.py
@ -11,19 +11,18 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 import math
 import logging
 import paddle
 from paddle import nn
 from paddle.nn import functional as F
 from paddle.nn import initializer as I
 from deepspeech.modules.mask import sequence_mask
 from deepspeech.modules.activation import brelu
 from deepspeech.modules.mask import sequence_mask
 from deepspeech.utils.log import Log
-logger = logging.getLogger(__name__)
+logger = Log(__name__).getlog()
 __all__ = ['RNNStack']
@ -41,7 +40,7 @@ class RNNCell(nn.RNNCellBase):
    """
    def __init__(self,
-                 hidden_size,
+                 hidden_size: int,
                 activation="tanh",
                 weight_ih_attr=None,
                 weight_hh_attr=None,
@ -108,8 +107,8 @@ class GRUCell(nn.RNNCellBase):
    """
    def __init__(self,
-                 input_size,
+                 input_size: int,
-                 hidden_size,
+                 hidden_size: int,
                 weight_ih_attr=None,
                 weight_hh_attr=None,
                 bias_ih_attr=None,
@ -132,7 +131,6 @@ class GRUCell(nn.RNNCellBase):
        self.input_size = input_size
        self._gate_activation = F.sigmoid
        self._activation = paddle.tanh
        #self._activation = F.relu
    def forward(self, inputs, states=None):
        if states is None:
@ -171,8 +169,6 @@ class BiRNNWithBN(nn.Layer):
    """Bidirectonal simple rnn layer with sequence-wise batch normalization.
    The batch normalization is only performed on input-state weights.
    :param name: Name of the layer parameters.
    :type name: string
    :param size: Dimension of RNN cells.
    :type size: int
    :param share_weights: Whether to share input-hidden weights between
@ -182,7 +178,7 @@ class BiRNNWithBN(nn.Layer):
    :rtype: Variable
    """
-    def __init__(self, i_size, h_size, share_weights):
+    def __init__(self, i_size: int, h_size: int, share_weights: bool):
        super().__init__()
        self.share_weights = share_weights
        if self.share_weights:
@ -208,7 +204,7 @@ class BiRNNWithBN(nn.Layer):
        self.bw_rnn = nn.RNN(
            self.fw_cell, is_reverse=True, time_major=False)  #[B, T, D]
-    def forward(self, x, x_len):
+    def forward(self, x: paddle.Tensor, x_len: paddle.Tensor):
        # x, shape [B, T, D]
        fw_x = self.fw_bn(self.fw_fc(x))
        bw_x = self.bw_bn(self.bw_fc(x))
@ -234,7 +230,7 @@ class BiGRUWithBN(nn.Layer):
    :rtype: Variable
    """
-    def __init__(self, i_size, h_size, act):
+    def __init__(self, i_size: int, h_size: int):
        super().__init__()
        hidden_size = h_size * 3
@ -281,23 +277,29 @@ class RNNStack(nn.Layer):
    :rtype: Variable
    """
-    def __init__(self, i_size, h_size, num_stacks, use_gru, share_rnn_weights):
+    def __init__(self,
                 i_size: int,
                 h_size: int,
                 num_stacks: int,
                 use_gru: bool,
                 share_rnn_weights: bool):
        super().__init__()
-        self.rnn_stacks = nn.LayerList()
+        rnn_stacks = []
        for i in range(num_stacks):
            if use_gru:
                #default:GRU using tanh
-                self.rnn_stacks.append(
+                rnn_stacks.append(BiGRUWithBN(i_size=i_size, h_size=h_size))
                    BiGRUWithBN(i_size=i_size, h_size=h_size, act="relu"))
            else:
-                self.rnn_stacks.append(
+                rnn_stacks.append(
                    BiRNNWithBN(
                        i_size=i_size,
                        h_size=h_size,
                        share_weights=share_rnn_weights))
            i_size = h_size * 2
-    def forward(self, x, x_len):
+        self.rnn_stacks = nn.ModuleList(rnn_stacks)
    def forward(self, x: paddle.Tensor, x_len: paddle.Tensor):
        """
        x: shape [B, T, D]
        x_len: shpae [B]
--- a/deepspeech/modules/subsampling.py
+++ b/deepspeech/modules/subsampling.py
@ -0,0 +1,239 @@
 # Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 """Subsampling layer definition."""
 from typing import Tuple
 import paddle
 from paddle import nn
 from deepspeech.modules.embedding import PositionalEncoding
 from deepspeech.utils.log import Log
 logger = Log(__name__).getlog()
 __all__ = [
    "LinearNoSubsampling", "Conv2dSubsampling4", "Conv2dSubsampling6",
    "Conv2dSubsampling8"
 ]
 class BaseSubsampling(nn.Layer):
    def __init__(self, pos_enc_class: nn.Layer=PositionalEncoding):
        super().__init__()
        self.pos_enc = pos_enc_class
        # window size = (1 + right_context) + (chunk_size -1) * subsampling_rate
        self.right_context = 0
        # stride = subsampling_rate * chunk_size
        self.subsampling_rate = 1
    def position_encoding(self, offset: int, size: int) -> paddle.Tensor:
        return self.pos_enc.position_encoding(offset, size)
 class LinearNoSubsampling(BaseSubsampling):
    """Linear transform the input without subsampling."""
    def __init__(self,
                 idim: int,
                 odim: int,
                 dropout_rate: float,
                 pos_enc_class: nn.Layer=PositionalEncoding):
        """Construct an linear object.
        Args:
            idim (int): Input dimension.
            odim (int): Output dimension.
            dropout_rate (float): Dropout rate.
            pos_enc_class (PositionalEncoding): position encoding class
        """
        super().__init__(pos_enc_class)
        self.out = nn.Sequential(
            nn.Linear(idim, odim),
            nn.LayerNorm(odim, epsilon=1e-12),
            nn.Dropout(dropout_rate), )
        self.right_context = 0
        self.subsampling_rate = 1
    def forward(self, x: paddle.Tensor, x_mask: paddle.Tensor, offset: int=0
                ) -> Tuple[paddle.Tensor, paddle.Tensor, paddle.Tensor]:
        """Input x.
        Args:
            x (paddle.Tensor): Input tensor (#batch, time, idim).
            x_mask (paddle.Tensor): Input mask (#batch, 1, time).
            offset (int): position encoding offset.
        Returns:
            paddle.Tensor: linear input tensor (#batch, time', odim),
                where time' = time .
            paddle.Tensor: positional encoding
            paddle.Tensor: linear input mask (#batch, 1, time'),
                where time' = time .
        """
        x = self.out(x)
        x, pos_emb = self.pos_enc(x, offset)
        return x, pos_emb, x_mask
 class Conv2dSubsampling4(BaseSubsampling):
    """Convolutional 2D subsampling (to 1/4 length)."""
    def __init__(self,
                 idim: int,
                 odim: int,
                 dropout_rate: float,
                 pos_enc_class: nn.Layer=PositionalEncoding):
        """Construct an Conv2dSubsampling4 object.
        Args:
            idim (int): Input dimension.
            odim (int): Output dimension.
            dropout_rate (float): Dropout rate.
        """
        super().__init__(pos_enc_class)
        self.conv = nn.Sequential(
            nn.Conv2D(1, odim, 3, 2),
            nn.ReLU(),
            nn.Conv2D(odim, odim, 3, 2),
            nn.ReLU(), )
        self.out = nn.Sequential(
            nn.Linear(odim * (((idim - 1) // 2 - 1) // 2), odim))
        self.subsampling_rate = 4
        # The right context for every conv layer is computed by:
        # (kernel_size - 1) / 2 * stride  * frame_rate_of_this_layer
        # 6 = (3 - 1) / 2 * 2 * 1 + (3 - 1) / 2 * 2 * 2
        self.right_context = 6
    def forward(self, x: paddle.Tensor, x_mask: paddle.Tensor, offset: int=0
                ) -> Tuple[paddle.Tensor, paddle.Tensor, paddle.Tensor]:
        """Subsample x.
        Args:
            x (paddle.Tensor): Input tensor (#batch, time, idim).
            x_mask (paddle.Tensor): Input mask (#batch, 1, time).
            offset (int): position encoding offset.
        Returns:
            paddle.Tensor: Subsampled tensor (#batch, time', odim),
                where time' = time // 4.
            paddle.Tensor: positional encoding
            paddle.Tensor: Subsampled mask (#batch, 1, time'),
                where time' = time // 4.
        """
        x = x.unsqueeze(1)  # (b, c=1, t, f)
        x = self.conv(x)
        b, c, t, f = paddle.shape(x)
        x = self.out(x.transpose([0, 2, 1, 3]).reshape([b, t, c * f]))
        x, pos_emb = self.pos_enc(x, offset)
        return x, pos_emb, x_mask[:, :, :-2:2][:, :, :-2:2]
 class Conv2dSubsampling6(BaseSubsampling):
    """Convolutional 2D subsampling (to 1/6 length)."""
    def __init__(self,
                 idim: int,
                 odim: int,
                 dropout_rate: float,
                 pos_enc_class: nn.Layer=PositionalEncoding):
        """Construct an Conv2dSubsampling6 object.
        Args:
            idim (int): Input dimension.
            odim (int): Output dimension.
            dropout_rate (float): Dropout rate.
            pos_enc (PositionalEncoding): Custom position encoding layer.
        """
        super().__init__(pos_enc_class)
        self.conv = nn.Sequential(
            nn.Conv2D(1, odim, 3, 2),
            nn.ReLU(),
            nn.Conv2D(odim, odim, 5, 3),
            nn.ReLU(), )
        # O = (I - F + Pstart + Pend) // S + 1
        # when Padding == 0, O = (I - F - S) // S
        self.linear = nn.Linear(odim * (((idim - 1) // 2 - 2) // 3), odim)
        # The right context for every conv layer is computed by:
        # (kernel_size - 1) / 2 * stride  * frame_rate_of_this_layer
        # 14 = (3 - 1) / 2 * 2 * 1 + (5 - 1) / 2 * 3 * 2
        self.subsampling_rate = 6
        self.right_context = 14
    def forward(self, x: paddle.Tensor, x_mask: paddle.Tensor, offset: int=0
                ) -> Tuple[paddle.Tensor, paddle.Tensor, paddle.Tensor]:
        """Subsample x.
        Args:
            x (paddle.Tensor): Input tensor (#batch, time, idim).
            x_mask (paddle.Tensor): Input mask (#batch, 1, time).
            offset (int): position encoding offset.
        Returns:
            paddle.Tensor: Subsampled tensor (#batch, time', odim),
                where time' = time // 6.
            paddle.Tensor: positional encoding
            paddle.Tensor: Subsampled mask (#batch, 1, time'),
                where time' = time // 6.
        """
        x = x.unsqueeze(1)  # (b, c, t, f)
        x = self.conv(x)
        b, c, t, f = paddle.shape(x)
        x = self.linear(x.transpose([0, 2, 1, 3]).reshape([b, t, c * f]))
        x, pos_emb = self.pos_enc(x, offset)
        return x, pos_emb, x_mask[:, :, :-2:2][:, :, :-4:3]
 class Conv2dSubsampling8(BaseSubsampling):
    """Convolutional 2D subsampling (to 1/8 length)."""
    def __init__(self,
                 idim: int,
                 odim: int,
                 dropout_rate: float,
                 pos_enc_class: nn.Layer=PositionalEncoding):
        """Construct an Conv2dSubsampling8 object.
        Args:
            idim (int): Input dimension.
            odim (int): Output dimension.
            dropout_rate (float): Dropout rate.
        """
        super().__init__(pos_enc_class)
        self.conv = nn.Sequential(
            nn.Conv2D(1, odim, 3, 2),
            nn.ReLU(),
            nn.Conv2D(odim, odim, 3, 2),
            nn.ReLU(),
            nn.Conv2D(odim, odim, 3, 2),
            nn.ReLU(), )
        self.linear = nn.Linear(odim * ((((idim - 1) // 2 - 1) // 2 - 1) // 2),
                                odim)
        self.subsampling_rate = 8
        # The right context for every conv layer is computed by:
        # (kernel_size - 1) / 2 * stride  * frame_rate_of_this_layer
        # 14 = (3 - 1) / 2 * 2 * 1 + (3 - 1) / 2 * 2 * 2 + (3 - 1) / 2 * 2 * 4
        self.right_context = 14
    def forward(self, x: paddle.Tensor, x_mask: paddle.Tensor, offset: int=0
                ) -> Tuple[paddle.Tensor, paddle.Tensor, paddle.Tensor]:
        """Subsample x.
        Args:
            x (paddle.Tensor): Input tensor (#batch, time, idim).
            x_mask (paddle.Tensor): Input mask (#batch, 1, time).
            offset (int): position encoding offset.
        Returns:
            paddle.Tensor: Subsampled tensor (#batch, time', odim),
                where time' = time // 8.
            paddle.Tensor: positional encoding
            paddle.Tensor: Subsampled mask (#batch, 1, time'),
                where time' = time // 8.
        """
        x = x.unsqueeze(1)  # (b, c, t, f)
        x = self.conv(x)
        x = self.linear(x.transpose([0, 2, 1, 3]).reshape([b, t, c * f]))
        x, pos_emb = self.pos_enc(x, offset)
        return x, pos_emb, x_mask[:, :, :-2:2][:, :, :-2:2][:, :, :-2:2]
--- a/deepspeech/training/init.py
+++ b/deepspeech/training/init.py
@ -11,5 +11,3 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 from deepspeech.training.trainer import *
--- a/deepspeech/training/cli.py
+++ b/deepspeech/training/cli.py
@ -11,7 +11,6 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 import argparse
@ -57,13 +56,19 @@ def default_argument_parser():
    # save jit model to 
    parser.add_argument("--export_path", type=str, help="path of the jit model to save")
    # save asr result to 
    parser.add_argument("--result_file", type=str, help="path of save the asr result")
    # running
-    parser.add_argument("--device", type=str, default='gpu', choices=["cpu", "gpu"], help="device type to use, cpu and gpu are supported.")
+    parser.add_argument("--device", type=str, default='gpu', choices=["cpu", "gpu"],
                        help="device type to use, cpu and gpu are supported.")
    parser.add_argument("--nprocs", type=int, default=1, help="number of parallel processes to use.")
    # overwrite extra config and default config
-    #parser.add_argument("--opts", nargs=argparse.REMAINDER, help="options to overwrite --config file and the default config, passing in KEY VALUE pairs")
+    # parser.add_argument("--opts", nargs=argparse.REMAINDER, 
-    parser.add_argument("--opts", type=str, default=[], nargs='+', help="options to overwrite --config file and the default config, passing in KEY VALUE pairs")
+    # help="options to overwrite --config file and the default config, passing in KEY VALUE pairs")
    parser.add_argument("--opts", type=str, default=[], nargs='+',
                        help="options to overwrite --config file and the default config, passing in KEY VALUE pairs")
    # yapd: enable
    return parser
--- a/deepspeech/training/gradclip.py
+++ b/deepspeech/training/gradclip.py
@ -11,18 +11,19 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 import logging
 import paddle
 from paddle.fluid.dygraph import base as imperative_base
 from paddle.fluid import layers
 from paddle.fluid import core
 from paddle.fluid import layers
 from paddle.fluid.dygraph import base as imperative_base
-logger = logging.getLogger(__name__)
+from deepspeech.utils.log import Log
 __all__ = ["ClipGradByGlobalNormWithLog"]
-class MyClipGradByGlobalNorm(paddle.nn.ClipGradByGlobalNorm):
+logger = Log(__name__).getlog()
 class ClipGradByGlobalNormWithLog(paddle.nn.ClipGradByGlobalNorm):
    def __init__(self, clip_norm):
        super().__init__(clip_norm)
@ -41,11 +42,11 @@ class MyClipGradByGlobalNorm(paddle.nn.ClipGradByGlobalNorm):
                merge_grad = layers.get_tensor_from_selected_rows(merge_grad)
            square = layers.square(merge_grad)
            sum_square = layers.reduce_sum(square)
            logger.info(
                f"Grad Before Clip: {p.name}: {float(layers.sqrt(layers.reduce_sum(layers.square(merge_grad))) ) }"
            )
            sum_square_list.append(sum_square)
            # debug log
            # logger.debug(f"Grad Before Clip: {p.name}: {float(sum_square.sqrt()) }")
        # all parameters have been filterd out
        if len(sum_square_list) == 0:
            return params_grads
@ -53,7 +54,9 @@ class MyClipGradByGlobalNorm(paddle.nn.ClipGradByGlobalNorm):
        global_norm_var = layers.concat(sum_square_list)
        global_norm_var = layers.reduce_sum(global_norm_var)
        global_norm_var = layers.sqrt(global_norm_var)
-        logger.info(f"Grad Global Norm: {float(global_norm_var)}!!!!")
+        # debug log
        logger.debug(f"Grad Global Norm: {float(global_norm_var)}!!!!")
        max_global_norm = layers.fill_constant(
            shape=[1], dtype=global_norm_var.dtype, value=self.clip_norm)
        clip_var = layers.elementwise_div(
@ -66,9 +69,11 @@ class MyClipGradByGlobalNorm(paddle.nn.ClipGradByGlobalNorm):
                params_and_grads.append((p, g))
                continue
            new_grad = layers.elementwise_mul(x=g, y=clip_var)
            logger.info(
                f"Grad After Clip: {p.name}: {float(layers.sqrt(layers.reduce_sum(layers.square(merge_grad))) ) }"
            )
            params_and_grads.append((p, new_grad))
            # debug log
            # logger.debug(
            #     f"Grad After Clip: {p.name}: {float(merge_grad.square().sum().sqrt())}"
            # )
        return params_and_grads
--- a/deepspeech/training/scheduler.py
+++ b/deepspeech/training/scheduler.py
@ -0,0 +1,66 @@
 # Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 from typing import Union
 from paddle.optimizer.lr import LRScheduler
 from typeguard import check_argument_types
 from deepspeech.utils.log import Log
 __all__ = ["WarmupLR"]
 logger = Log(__name__).getlog()
 class WarmupLR(LRScheduler):
    """The WarmupLR scheduler
    This scheduler is almost same as NoamLR Scheduler except for following
    difference:
    NoamLR:
        lr = optimizer.lr * model_size ** -0.5
             * min(step ** -0.5, step * warmup_step ** -1.5)
    WarmupLR:
        lr = optimizer.lr * warmup_step ** 0.5
             * min(step ** -0.5, step * warmup_step ** -1.5)
    Note that the maximum lr equals to optimizer.lr in this scheduler.
    """
    def __init__(self,
                 warmup_steps: Union[int, float]=25000,
                 learning_rate=1.0,
                 last_epoch=-1,
                 verbose=False):
        assert check_argument_types()
        self.warmup_steps = warmup_steps
        super().__init__(learning_rate, last_epoch, verbose)
    def __repr__(self):
        return f"{self.__class__.__name__}(warmup_steps={self.warmup_steps})"
    def get_lr(self):
        step_num = self.last_epoch + 1
        return self.base_lr * self.warmup_steps**0.5 * min(
            step_num**-0.5, step_num * self.warmup_steps**-1.5)
    def set_step(self, step: int=None):
        '''
        It will update the learning rate in optimizer according to current ``epoch`` .  
        The new learning rate will take effect on next ``optimizer.step`` .
        Args:
            step (int, None): specify current epoch. Default: None. Auto-increment from last_epoch=-1.
        Returns:
            None
        '''
        self.step(epoch=step)
--- a/Show More
+++ b/Show More