poonehmousavi commited on
Commit
a27ef29
·
1 Parent(s): a03ea6b

Upload 3 files

Browse files
Files changed (3) hide show
  1. README.md +124 -1
  2. config.json +3 -0
  3. hyperparams.yaml +120 -0
README.md CHANGED
@@ -1,3 +1,126 @@
1
  ---
2
- license: apache-2.0
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ language: "de"
3
+ thumbnail:
4
+ tags:
5
+ - automatic-speech-recognition
6
+ - CTC
7
+ - Attention
8
+ - pytorch
9
+ - speechbrain
10
+ license: "apache-2.0"
11
+ datasets:
12
+ - common_voice
13
+ metrics:
14
+ - wer
15
+ - cer
16
  ---
17
+
18
+ <iframe src="https://ghbtns.com/github-btn.html?user=speechbrain&repo=speechbrain&type=star&count=true&size=large&v=2" frameborder="0" scrolling="0" width="170" height="30" title="GitHub"></iframe>
19
+ <br/><br/>
20
+
21
+ # CRDNN with CTC/Attention trained on CommonVoice 14.0 German (No LM)
22
+ This repository provides all the necessary tools to perform automatic speech
23
+ recognition from an end-to-end system pretrained on CommonVoice (German Language) within
24
+ SpeechBrain. For a better experience, we encourage you to learn more about
25
+ [SpeechBrain](https://speechbrain.github.io).
26
+ The performance of the model is the following:
27
+
28
+ | Release | Test CER | Test WER | GPUs |
29
+ |:-------------:|:--------------:|:--------------:| :--------:|
30
+ | 15.08.23 | 3.82 | 12.25 | 1xV100 16GB |
31
+
32
+ ## Credits
33
+ The model is provided by [vitas.ai](https://www.vitas.ai/).
34
+
35
+ ## Pipeline description
36
+ This ASR system is composed of 2 different but linked blocks:
37
+
38
+ - Tokenizer (unigram) that transforms words into subword units and trained with
39
+ the train transcriptions (train.tsv) of CommonVoice (DE).
40
+ - Acoustic model (CRDNN + CTC/Attention). The CRDNN architecture is made of
41
+ N blocks of convolutional neural networks with normalization and pooling on the
42
+ frequency domain. Then, a bidirectional LSTM is connected to a final DNN to obtain
43
+ the final acoustic representation that is given to the CTC and attention decoders.
44
+
45
+ The system is trained with recordings sampled at 16kHz (single channel).
46
+ The code will automatically normalize your audio (i.e., resampling + mono channel selection) when calling *transcribe_file* if needed.
47
+
48
+ ## Install SpeechBrain
49
+ First of all, please install SpeechBrain with the following command:
50
+
51
+ ```
52
+ pip install speechbrain
53
+ ```
54
+
55
+ Please notice that we encourage you to read our tutorials and learn more about
56
+ [SpeechBrain](https://speechbrain.github.io).
57
+
58
+ ### Transcribing your own audio files (in German)
59
+
60
+ ```python
61
+ from speechbrain.pretrained import EncoderDecoderASR
62
+ asr_model = EncoderDecoderASR.from_hparams(source="speechbrain/speechbrain/asr-crdnn-commonvoice-14-de", savedir="pretrained_models/speechbrain/asr-crdnn-commonvoice-14-de")
63
+ asr_model.transcribe_file("speechbrain/speechbrain/asr-crdnn-commonvoice-14-de/example-de.wav")
64
+ ```
65
+
66
+ ### Inference on GPU
67
+
68
+ To perform inference on the GPU, add `run_opts={"device":"cuda"}` when calling the `from_hparams` method.
69
+
70
+ ## Parallel Inference on a Batch
71
+
72
+ Please, [see this Colab notebook](https://colab.research.google.com/drive/1hX5ZI9S4jHIjahFCZnhwwQmFoGAi3tmu?usp=sharing) to figure out how to transcribe in parallel a batch of input sentences using a pre-trained model.
73
+
74
+ ### Training
75
+
76
+ The model was trained with SpeechBrain (986a2175).
77
+ To train it from scratch follows these steps:
78
+
79
+ 1. Clone SpeechBrain:
80
+
81
+ ```bash
82
+ git clone https://github.com/speechbrain/speechbrain/
83
+ ```
84
+
85
+ 2. Install it:
86
+
87
+ ```
88
+ cd speechbrain
89
+ pip install -r requirements.txt
90
+ pip install -e .
91
+ ```
92
+
93
+ 3. Run Training:
94
+
95
+ ```
96
+ cd recipes/CommonVoice/ASR/seq2seq
97
+ python train.py hparams/train_de.yaml --data_folder=your_data_folder
98
+ ```
99
+
100
+ You can find our training results (models, logs, etc) [here](https://www.dropbox.com/sh/zgatirb118f79ef/AACmjh-D94nNDWcnVI4Ef5K7a?dl=0)
101
+
102
+ ### Limitations
103
+
104
+ The SpeechBrain team does not provide any warranty on the performance achieved by this model when used on other datasets.
105
+
106
+ # **About SpeechBrain**
107
+
108
+ - Website: https://speechbrain.github.io/
109
+ - Code: https://github.com/speechbrain/speechbrain/
110
+ - HuggingFace: https://huggingface.co/speechbrain/
111
+
112
+ # **Citing SpeechBrain**
113
+
114
+ Please, cite SpeechBrain if you use it for your research or business.
115
+
116
+ ```bibtex
117
+ @misc{speechbrain,
118
+ title={{SpeechBrain}: A General-Purpose Speech Toolkit},
119
+ author={Mirco Ravanelli and Titouan Parcollet and Peter Plantinga and Aku Rouhe and Samuele Cornell and Loren Lugosch and Cem Subakan and Nauman Dawalatabad and Abdelwahab Heba and Jianyuan Zhong and Ju-Chieh Chou and Sung-Lin Yeh and Szu-Wei Fu and Chien-Feng Liao and Elena Rastorgueva and François Grondin and William Aris and Hwidong Na and Yan Gao and Renato De Mori and Yoshua Bengio},
120
+ year={2021},
121
+ eprint={2106.04624},
122
+ archivePrefix={arXiv},
123
+ primaryClass={eess.AS},
124
+ note={arXiv:2106.04624}
125
+ }
126
+ ```
config.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ {
2
+ "speechbrain_interface": "EncoderDecoderASR"
3
+ }
hyperparams.yaml ADDED
@@ -0,0 +1,120 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # ################################
2
+ # Model: VGG2 + LSTM + time pooling
3
+ # Augmentation: SpecAugment
4
+ # Authors: Titouan Parcollet, Mirco Ravanelli, Peter Plantinga, Ju-Chieh Chou,
5
+ # and Abdel HEBA 2020
6
+ # ################################
7
+ # Feature parameters (FBANKS etc)
8
+ sample_rate: 16000
9
+ n_fft: 400
10
+ n_mels: 80
11
+ # Model parameters
12
+ activation: !name:torch.nn.LeakyReLU
13
+ dropout: 0.15
14
+ cnn_blocks: 3
15
+ cnn_channels: (128, 200, 256)
16
+ inter_layer_pooling_size: (2, 2, 2)
17
+ cnn_kernelsize: (3, 3)
18
+ time_pooling_size: 4
19
+ rnn_class: !name:speechbrain.nnet.RNN.LSTM
20
+ rnn_layers: 5
21
+ rnn_neurons: 1024
22
+ rnn_bidirectional: True
23
+ dnn_blocks: 2
24
+ dnn_neurons: 1024
25
+ emb_size: 128
26
+ dec_neurons: 1024
27
+ # Outputs
28
+ output_neurons: 500 # BPE size, index(blank/eos/bos) = 0
29
+ # Decoding parameters
30
+ # Be sure that the bos and eos index match with the BPEs ones
31
+ blank_index: 0
32
+ bos_index: 0
33
+ eos_index: 0
34
+ min_decode_ratio: 0.0
35
+ max_decode_ratio: 1.0
36
+ beam_size: 80
37
+ eos_threshold: 1.5
38
+ using_max_attn_shift: True
39
+ max_attn_shift: 140
40
+ ctc_weight_decode: 0.0
41
+ temperature: 1.50
42
+ normalizer: !new:speechbrain.processing.features.InputNormalization
43
+ norm_type: global
44
+ compute_features: !new:speechbrain.lobes.features.Fbank
45
+ sample_rate: !ref <sample_rate>
46
+ n_fft: !ref <n_fft>
47
+ n_mels: !ref <n_mels>
48
+ enc: !new:speechbrain.lobes.models.CRDNN.CRDNN
49
+ input_shape: [null, null, !ref <n_mels>]
50
+ activation: !ref <activation>
51
+ dropout: !ref <dropout>
52
+ cnn_blocks: !ref <cnn_blocks>
53
+ cnn_channels: !ref <cnn_channels>
54
+ cnn_kernelsize: !ref <cnn_kernelsize>
55
+ inter_layer_pooling_size: !ref <inter_layer_pooling_size>
56
+ time_pooling: True
57
+ using_2d_pooling: False
58
+ time_pooling_size: !ref <time_pooling_size>
59
+ rnn_class: !ref <rnn_class>
60
+ rnn_layers: !ref <rnn_layers>
61
+ rnn_neurons: !ref <rnn_neurons>
62
+ rnn_bidirectional: !ref <rnn_bidirectional>
63
+ rnn_re_init: True
64
+ dnn_blocks: !ref <dnn_blocks>
65
+ dnn_neurons: !ref <dnn_neurons>
66
+ emb: !new:speechbrain.nnet.embedding.Embedding
67
+ num_embeddings: !ref <output_neurons>
68
+ embedding_dim: !ref <emb_size>
69
+ dec: !new:speechbrain.nnet.RNN.AttentionalRNNDecoder
70
+ enc_dim: !ref <dnn_neurons>
71
+ input_size: !ref <emb_size>
72
+ rnn_type: gru
73
+ attn_type: location
74
+ hidden_size: 1024
75
+ attn_dim: 1024
76
+ num_layers: 1
77
+ scaling: 1.0
78
+ channels: 10
79
+ kernel_size: 100
80
+ re_init: True
81
+ dropout: !ref <dropout>
82
+ ctc_lin: !new:speechbrain.nnet.linear.Linear
83
+ input_size: !ref <dnn_neurons>
84
+ n_neurons: !ref <output_neurons>
85
+ seq_lin: !new:speechbrain.nnet.linear.Linear
86
+ input_size: !ref <dec_neurons>
87
+ n_neurons: !ref <output_neurons>
88
+ log_softmax: !new:speechbrain.nnet.activations.Softmax
89
+ apply_log: True
90
+ asr_model: !new:torch.nn.ModuleList
91
+ - [!ref <enc>, !ref <emb>, !ref <dec>, !ref <ctc_lin>, !ref <seq_lin>]
92
+ tokenizer: !new:sentencepiece.SentencePieceProcessor
93
+ # We compose the inference (encoder) pipeline.
94
+ encoder: !new:speechbrain.nnet.containers.LengthsCapableSequential
95
+ input_shape: [null, null, !ref <n_mels>]
96
+ compute_features: !ref <compute_features>
97
+ normalize: !ref <normalizer>
98
+ model: !ref <enc>
99
+ decoder: !new:speechbrain.decoders.S2SRNNBeamSearcher
100
+ embedding: !ref <emb>
101
+ decoder: !ref <dec>
102
+ linear: !ref <seq_lin>
103
+ bos_index: !ref <bos_index>
104
+ eos_index: !ref <eos_index>
105
+ min_decode_ratio: !ref <min_decode_ratio>
106
+ max_decode_ratio: !ref <max_decode_ratio>
107
+ beam_size: !ref <beam_size>
108
+ eos_threshold: !ref <eos_threshold>
109
+ using_max_attn_shift: !ref <using_max_attn_shift>
110
+ max_attn_shift: !ref <max_attn_shift>
111
+ temperature: !ref <temperature>
112
+ modules:
113
+ normalizer: !ref <normalizer>
114
+ encoder: !ref <encoder>
115
+ decoder: !ref <decoder>
116
+ pretrainer: !new:speechbrain.utils.parameter_transfer.Pretrainer
117
+ loadables:
118
+ normalizer: !ref <normalizer>
119
+ asr: !ref <asr_model>
120
+ tokenizer: !ref <tokenizer>