metadata
tags:
- speech-recognition
- ASR
- k2
- sherpa
- PyTorch
license: cc-by-4.0
library_name: icefall
datasets:
- librispeech
inference: false
-1. Create your own virtualenv
Install CUDA and cuDNN
- Run the following command:
nvidia-smi | head -n 4
Install CUDA <= Cuda Version mentioned.
- Install CUDA (I am installing CUDA 12.1)
wget https://developer.download.nvidia.com/compute/cuda/12.1.0/local_installers/cuda_12.1.0_530.30.02_linux.run
chmod +x cuda_12.1.0_530.30.02_linux.run
(change the 'installpath')
./cuda_12.1.0_530.30.02_linux.run \
--silent \
--toolkit \
--installpath=/speech/hasan/software/cuda-12.1.0 \
--no-opengl-libs \
--no-drm \
--no-man-page
Install cuDNN for CUDA 12.1
wget https://huggingface.co/csukuangfj/cudnn/resolve/main/cudnn-linux-x86_64-8.9.5.29_cuda12-archive.tar.xz
tar xvf cudnn-linux-x86_64-8.9.5.29_cuda12-archive.tar.xz --strip-components=1 -C /speech/hasan/software/cuda-12.1.0
Create a file activate-cuda-12.1.sh
, copy the following code and then run source activate-cuda-12.1.sh
export CUDA_HOME=/speech/hasan/software/cuda-12.1.0
export PATH=$CUDA_HOME/bin:$PATH
export LD_LIBRARY_PATH=$CUDA_HOME/lib64:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=$CUDA_HOME/lib:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=$CUDA_HOME/extras/CUPTI/lib64:$LD_LIBRARY_PATH
export CUDAToolkit_ROOT_DIR=$CUDA_HOME
export CUDAToolkit_ROOT=$CUDA_HOME
export CUDA_TOOLKIT_ROOT_DIR=$CUDA_HOME
export CUDA_TOOLKIT_ROOT=$CUDA_HOME
export CUDA_BIN_PATH=$CUDA_HOME
export CUDA_PATH=$CUDA_HOME
export CUDA_INC_PATH=$CUDA_HOME/targets/x86_64-linux
export CFLAGS=-I$CUDA_HOME/targets/x86_64-linux/include:$CFLAGS
export CUDAToolkit_TARGET_DIR=$CUDA_HOME/targets/x86_64-linux
Check your installation by running:
which nvcc
Desired output:
/speech/hasan/software/cuda-12.1.0/bin/nvcc
nvcc --version
Desired output:
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Feb__7_19:32:13_PST_2023
Cuda compilation tools, release 12.1, V12.1.66
Build cuda_12.1.r12.1/compiler.32415258_0
Install Torch and TorchAudio
torch==2.2.1 and torchaudio==2.2.1 are compatible, reference, so I'll install that
pip install torch==2.2.1+cu121 torchaudio==2.2.1+cu121 -f https://download.pytorch.org/whl/torch_stable.html
Verify Installation
python3 -c "import torch; print(torch.__version__)"
python3 -c "import torchaudio; print(torchaudio.__version__)"
Desired output:
2.3.0+cu121
Install k2
pip install k2==1.24.4.dev20240425+cuda12.1.torch2.2.1 -f https://k2-fsa.github.io/k2/cuda.html
Verify Installation
python3 -m k2.version
Install lhotse
pip install git+https://github.com/lhotse-speech/lhotse
Verify Installation:
python3 -c "import lhotse; print(lhotse.__version__)"
Desired output:
1.24.0.dev+git.4d57d53.clean
Install icefall
git clone https://github.com/k2-fsa/icefall
cd icefall/
pip install -r ./requirements.txt
Export the path where you cloned icefall
export PYTHONPATH=/speech/hasan/icefall_install/icefall:$PYTHONPATH
cd egs/yesno/ASR/
Test your Installation
./prepare.sh
export CUDA_VISIBLE_DEVICES="" ./tdnn/train.py
./tdnn/decode.py
## Congrats!
[Reference](https://icefall.readthedocs.io/en/latest/installation/index.html)
## install kaldi feat
pip install kaldifeat==1.25.4.dev20240425+cpu.torch2.3.0 -f https://csukuangfj.github.io/kaldifeat/cpu.html
## install sherpa
pip install k2_sherpa==1.3.dev20240227+cpu.torch2.2.1 -f https://k2-fsa.github.io/sherpa/cpu.html
## training
python3 egs/<dataset_name>/ASR/zipformer/train.py \
--world-size <number_of_gpus> \
--num-epochs <number_of_epochs> \
--start-epoch <starting_epoch> \
--exp-dir <experiment_directory> \
--max-duration <max_duration_per_batch> \
--num-workers <number_of_data_workers> \
--on-the-fly-feats <True_or_False> \
--manifest-dir <manifest_directory> \
--num-buckets <number_of_buckets> \
--bpe-model <path_to_bpe_model> \
--train-cuts <path_to_training_cuts> \
--valid-cuts <path_to_validation_cuts> \
--causal <1_or_0> \
--master-port <port_number>
Parameter Reference:
--world-size: Number of GPUs or processes to use for distributed training.
--num-epochs: Total number of epochs to run the training.
--start-epoch: Epoch to start training from (helpful when resuming).
--exp-dir: Path to the directory where experiment logs and model checkpoints will be saved.
--max-duration: Maximum duration of audio samples per batch (in seconds or milliseconds, depending on the setup).
--num-workers: Number of workers for loading data.
--on-the-fly-feats: Whether to compute features on-the-fly during training (True or False).
--manifest-dir: Directory containing the manifest files (JSON) for training and validation data.
--num-buckets: Number of buckets used for bucketing data by sequence length.
--bpe-model: Path to the Byte-Pair Encoding model for text tokenization.
--train-cuts: Path to the JSONL file containing the training cuts.
--valid-cuts: Path to the JSONL file containing the validation cuts.
--causal: Set to 1 for causal training (useful for certain model architectures like Zipformer).
--master-port: Port number for distributed training communication
# sample decode file
Streaming ASR Decoding with Zipformer
This script facilitates the streaming decoding of ASR models using Zipformer in the Icefall framework. It supports greedy search decoding along with the configuration for chunked streaming.
./streaming_decode.py --epoch <EPOCH_NUMBER> \
--avg <AVERAGE_NUMBER> \
--exp-dir <EXPERIMENT_DIR> \
--decoding-method <DECODING_METHOD> \
--manifest-dir <MANIFEST_DIR> \
--cut-set-name <CUT_SET_NAME> \
--bpe-model <BPE_MODEL_PATH> \
--causal <CAUSAL_FLAG> \
--chunk-size <CHUNK_SIZE> \
--left-context-frames <LEFT_CONTEXT_FRAMES> \
--on-the-fly-feats <ON_THE_FLY_FEATS_FLAG> \
--use-averaged-model <AVERAGED_MODEL_FLAG> \
--num-workers <NUM_WORKERS> \
--max-duration <MAX_DURATION> \
--num-decode-streams <NUM_DECODE_STREAMS> \
--context-size <CONTEXT_SIZE>
Parameters
--epoch: Specifies which training epoch to use for decoding. A higher epoch number means the model has undergone more training.
--avg: Number of checkpoints to average. For example, --avg 4 means the last 4 checkpoints will be averaged for decoding.
--exp-dir: Directory where the model's experimental data, such as checkpoints and logs, are stored.
--decoding-method: Decoding strategy to be used. Common methods include greedy_search, beam_search, etc.
--manifest-dir: Directory containing manifest files for the datasets to be decoded.
--cut-set-name: Specifies which cut set to use for decoding, typically indicating the subset of data like test_1, test_2, etc.
--bpe-model: Path to the BPE model to be used for tokenization during decoding.
--causal: Indicates whether causal convolution should be used. Set 1 for causal and 0 for non-causal.
--chunk-size: The size of each chunk to be processed during streaming.
--left-context-frames: Number of frames from the left context to be included during chunked decoding.
--on-the-fly-feats: If set to True, feature extraction is performed on-the-fly, without precomputing the features.
--use-averaged-model: If True, the model will use averaged parameters from multiple epochs or checkpoints.
--num-workers: Number of workers to be used for data loading during decoding.
--max-duration: The maximum duration (in seconds) of audio files to decode in one batch.
--num-decode-streams: Number of parallel decoding streams to process.
--context-size: The size of the right context to be used during chunk-based streaming decoding.
Sherpa Online WebSocket Server
This script sets up a WebSocket server for real-time ASR decoding using the Sherpa framework. It supports GPU-based decoding, different decoding methods, and tokenized models.
sherpa-online-websocket-server --use-gpu=<USE_GPU_FLAG> \
--tokens=<TOKENS_FILE_PATH> \
--port=<PORT_NUMBER> \
--doc-root=<DOCUMENT_ROOT> \
--nn-model=<MODEL_PATH> \
--decoding-method=<DECODING_METHOD>
Parameters
--use-gpu: Set this flag to True for GPU-based decoding, or False for CPU-based decoding.
--tokens: Path to the file containing the token list (e.g., BPE tokens) required for decoding.
--port: Port number for the WebSocket server. Ensure this port is open and not blocked by firewalls.
--doc-root: The root directory for the server's documentation or web resources. This is the directory that serves files when accessed via a browser.
--nn-model: Path to the neural network model to be used for decoding. The model is usually a jit_script file trained for speech recognition.
--decoding-method: The decoding strategy to use. Common methods include greedy_search, beam_search, etc. Choose based on your model and application needs.
Example
sherpa-online-websocket-server --use-gpu=True \
--tokens=/path/to/tokens.txt \
--port=8003 \
--doc-root=/path/to/web/document/root \
--nn-model=/path/to/jit_script_model.pt \
--decoding-method=greedy_search
Notes
GPU support: If using GPU, ensure that CUDA is properly set up on the system.
Token file: The token file should correspond to the language and tokenization scheme used when training the neural network model.
Neural Network Model: The model provided should be compatible with the decoding method specified (e.g., chunk-based decoding for streaming models).