sivan22 commited on
Commit
38bec15
·
verified ·
1 Parent(s): 1c35061

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +232 -0
README.md ADDED
@@ -0,0 +1,232 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ - zh
5
+ - de
6
+ - es
7
+ - ru
8
+ - ko
9
+ - fr
10
+ - ja
11
+ - pt
12
+ - tr
13
+ - pl
14
+ - ca
15
+ - nl
16
+ - ar
17
+ - sv
18
+ - it
19
+ - id
20
+ - hi
21
+ - fi
22
+ - vi
23
+ - he
24
+ - uk
25
+ - el
26
+ - ms
27
+ - cs
28
+ - ro
29
+ - da
30
+ - hu
31
+ - ta
32
+ - 'no'
33
+ - th
34
+ - ur
35
+ - hr
36
+ - bg
37
+ - lt
38
+ - la
39
+ - mi
40
+ - ml
41
+ - cy
42
+ - sk
43
+ - te
44
+ - fa
45
+ - lv
46
+ - bn
47
+ - sr
48
+ - az
49
+ - sl
50
+ - kn
51
+ - et
52
+ - mk
53
+ - br
54
+ - eu
55
+ - is
56
+ - hy
57
+ - ne
58
+ - mn
59
+ - bs
60
+ - kk
61
+ - sq
62
+ - sw
63
+ - gl
64
+ - mr
65
+ - pa
66
+ - si
67
+ - km
68
+ - sn
69
+ - yo
70
+ - so
71
+ - af
72
+ - oc
73
+ - ka
74
+ - be
75
+ - tg
76
+ - sd
77
+ - gu
78
+ - am
79
+ - yi
80
+ - lo
81
+ - uz
82
+ - fo
83
+ - ht
84
+ - ps
85
+ - tk
86
+ - nn
87
+ - mt
88
+ - sa
89
+ - lb
90
+ - my
91
+ - bo
92
+ - tl
93
+ - mg
94
+ - as
95
+ - tt
96
+ - haw
97
+ - ln
98
+ - ha
99
+ - ba
100
+ - jw
101
+ - su
102
+ tags:
103
+ - audio
104
+ - automatic-speech-recognition
105
+ - hf-asr-leaderboard
106
+ widget:
107
+ - example_title: Librispeech sample 1
108
+ src: https://cdn-media.huggingface.co/speech_samples/sample1.flac
109
+ - example_title: Librispeech sample 2
110
+ src: https://cdn-media.huggingface.co/speech_samples/sample2.flac
111
+ pipeline_tag: automatic-speech-recognition
112
+ license: apache-2.0
113
+ datasets:
114
+ - ivrit-ai/whisper-training
115
+ ---
116
+
117
+ # NOTE: THIS IS A CT-2 (Faster-Whisper) version of the model
118
+ the original model can be found [here](https://huggingface.co/ivrit-ai/whisper-large-v2-tuned)
119
+
120
+ # Whisper
121
+
122
+ Whisper is a pre-trained model for automatic speech recognition (ASR) and speech translation.
123
+ More details about it are available [here](https://huggingface.co/openai/whisper-large-v2).
124
+
125
+ **whisper-large-v2-tuned** is a version of whisper-large-v2, fine-tuned by [ivrit.ai](https://www.ivrit.ai) to improve Hebrew ASR using crowd-sourced labeling.
126
+
127
+ ## Model details
128
+
129
+ This model comes as a single checkpoint, whisper-large-v2-tuned.
130
+ It is a 1550M parameters multi-lingual ASR solution.
131
+
132
+ # Usage
133
+
134
+ To transcribe audio samples, the model has to be used alongside a [`WhisperProcessor`](https://huggingface.co/docs/transformers/model_doc/whisper#transformers.WhisperProcessor).
135
+
136
+ ```python
137
+ import torch
138
+ from transformers import WhisperProcessor, WhisperForConditionalGeneration
139
+
140
+ SAMPLING_RATE = 16000
141
+
142
+ has_cuda = torch.cuda.is_available()
143
+ model_path = 'ivrit-ai/whisper-large-v2-tuned'
144
+
145
+ model = WhisperForConditionalGeneration.from_pretrained(model_path)
146
+ if has_cuda:
147
+ model.to('cuda:0')
148
+
149
+ processor = WhisperProcessor.from_pretrained(model_path)
150
+
151
+ # audio_resample based on entry being part of an existing dataset.
152
+ # Alternatively, this can be loaded from an audio file.
153
+ audio_resample = librosa.resample(entry['audio']['array'], orig_sr=entry['audio']['sampling_rate'], target_sr=SAMPLING_RATE)
154
+
155
+ input_features = processor(audio_resample, sampling_rate=SAMPLING_RATE, return_tensors="pt").input_features
156
+ if has_cuda:
157
+ input_features = input_features.to('cuda:0')
158
+
159
+ predicted_ids = model.generate(input_features, language='he', num_beams=5)
160
+ transcript = processor.batch_decode(predicted_ids, skip_special_tokens=True)
161
+
162
+ print(f'Transcript: {transcription[0]}')
163
+ ```
164
+
165
+ ## Evaluation
166
+
167
+ You can use the [evaluate_model.py](https://github.com/yairl/ivrit.ai/blob/master/evaluate_model.py) reference on GitHub to evalute the model's quality.
168
+
169
+ ## Long-Form Transcription
170
+
171
+ The Whisper model is intrinsically designed to work on audio samples of up to 30s in duration. However, by using a chunking
172
+ algorithm, it can be used to transcribe audio samples of up to arbitrary length. This is possible through Transformers
173
+ [`pipeline`](https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.AutomaticSpeechRecognitionPipeline)
174
+ method. Chunking is enabled by setting `chunk_length_s=30` when instantiating the pipeline. With chunking enabled, the pipeline
175
+ can be run with batched inference. It can also be extended to predict sequence level timestamps by passing `return_timestamps=True`:
176
+
177
+ ```python
178
+ >>> import torch
179
+ >>> from transformers import pipeline
180
+ >>> from datasets import load_dataset
181
+
182
+ >>> device = "cuda:0" if torch.cuda.is_available() else "cpu"
183
+
184
+ >>> pipe = pipeline(
185
+ >>> "automatic-speech-recognition",
186
+ >>> model="ivrit-ai/whisper-large-v2-tuned",
187
+ >>> chunk_length_s=30,
188
+ >>> device=device,
189
+ >>> )
190
+
191
+ >>> ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
192
+ >>> sample = ds[0]["audio"]
193
+
194
+ >>> prediction = pipe(sample.copy(), batch_size=8)["text"]
195
+ " Mr. Quilter is the apostle of the middle classes, and we are glad to welcome his gospel."
196
+
197
+ >>> # we can also return timestamps for the predictions
198
+ >>> prediction = pipe(sample.copy(), batch_size=8, return_timestamps=True)["chunks"]
199
+ [{'text': ' Mr. Quilter is the apostle of the middle classes and we are glad to welcome his gospel.',
200
+ 'timestamp': (0.0, 5.44)}]
201
+ ```
202
+
203
+ Refer to the blog post [ASR Chunking](https://huggingface.co/blog/asr-chunking) for more details on the chunking algorithm.
204
+
205
+
206
+
207
+ ### BibTeX entry and citation info
208
+
209
+ **ivrit.ai: A Comprehensive Dataset of Hebrew Speech for AI Research and Development**
210
+ ```bibtex
211
+ @misc{marmor2023ivritai,
212
+ title={ivrit.ai: A Comprehensive Dataset of Hebrew Speech for AI Research and Development},
213
+ author={Yanir Marmor and Kinneret Misgav and Yair Lifshitz},
214
+ year={2023},
215
+ eprint={2307.08720},
216
+ archivePrefix={arXiv},
217
+ primaryClass={eess.AS}
218
+ }
219
+ ```
220
+
221
+ **Whisper: Robust Speech Recognition via Large-Scale Weak Supervision**
222
+ ```bibtex
223
+ @misc{radford2022whisper,
224
+ doi = {10.48550/ARXIV.2212.04356},
225
+ url = {https://arxiv.org/abs/2212.04356},
226
+ author = {Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},
227
+ title = {Robust Speech Recognition via Large-Scale Weak Supervision},
228
+ publisher = {arXiv},
229
+ year = {2022},
230
+ copyright = {arXiv.org perpetual, non-exclusive license}
231
+ }
232
+ ```