huseinzol05
commited on
Update README.md
Browse files
README.md
CHANGED
@@ -16,4 +16,86 @@ WanDB at https://wandb.ai/huseinzol05/malaysian-whisper-small-v2, **still on tra
|
|
16 |
|
17 |
1. Distilled from Whisper Large V3 on Malaysian and Science context.
|
18 |
2. Better translation for Malay, Manglish, Mandarin, Tamil and Science context.
|
19 |
-
3. Word level timestamp, **new task!**
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
16 |
|
17 |
1. Distilled from Whisper Large V3 on Malaysian and Science context.
|
18 |
2. Better translation for Malay, Manglish, Mandarin, Tamil and Science context.
|
19 |
+
3. Word level timestamp, introduced `<|transcribeprecise|>` token, **a new task!**
|
20 |
+
|
21 |
+
## how to
|
22 |
+
|
23 |
+
Load the model,
|
24 |
+
|
25 |
+
```python
|
26 |
+
import torch
|
27 |
+
from transformers.models.whisper import tokenization_whisper
|
28 |
+
|
29 |
+
tokenization_whisper.TASK_IDS = ["translate", "transcribe", 'transcribeprecise']
|
30 |
+
|
31 |
+
from transformers import WhisperForConditionalGeneration, WhisperProcessor
|
32 |
+
|
33 |
+
processor = WhisperProcessor.from_pretrained(
|
34 |
+
'mesolitica/malaysian-whisper-small-v2'
|
35 |
+
)
|
36 |
+
tokenizer = processor.tokenizer
|
37 |
+
model = WhisperForConditionalGeneration.from_pretrained(
|
38 |
+
'mesolitica/malaysian-whisper-small-v2', torch_dtype = torch.bfloat16
|
39 |
+
).cuda().eval()
|
40 |
+
```
|
41 |
+
|
42 |
+
### Transcribe
|
43 |
+
|
44 |
+
```python
|
45 |
+
from datasets import Audio
|
46 |
+
import requests
|
47 |
+
|
48 |
+
sr = 16000
|
49 |
+
audio = Audio(sampling_rate=sr)
|
50 |
+
|
51 |
+
r = requests.get('https://github.com/mesolitica/malaya-speech/raw/master/speech/assembly.mp3')
|
52 |
+
y = audio.decode_example(audio.encode_example(r.content))['array']
|
53 |
+
|
54 |
+
with torch.no_grad():
|
55 |
+
p = processor([y], return_tensors='pt')
|
56 |
+
p['input_features'] = p['input_features'].to(torch.bfloat16)
|
57 |
+
r = model.generate(
|
58 |
+
p['input_features'].cuda(),
|
59 |
+
output_scores=True,
|
60 |
+
return_dict_in_generate=True,
|
61 |
+
language='ms',
|
62 |
+
return_timestamps=True, task = 'transcribe')
|
63 |
+
|
64 |
+
tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(r.sequences[0]))
|
65 |
+
```
|
66 |
+
|
67 |
+
```
|
68 |
+
<|startoftranscript|><|ms|><|transcribe|><|0.02|> Assembly on Aging di Vienna, Australia<|3.78|><|3.78|> yang telah diadakan pada tahun 1982<|6.50|><|6.50|> dan berasaskan unjuran tersebut<|8.82|><|8.82|> maka Jabatan Perangkaan Malaysia<|10.40|><|10.40|> menganggarkan menjelang tahun 2035<|13.72|><|13.72|> sejumlah 15% penduduk kita adalah daripada kalangan warga emas.<|18.72|><|19.28|> Untuk makluman Tuan Yang Pertua dan juga Alia Mbahumat,<|22.12|><|22.26|> pembangunan sistem pendaftaran warga emas<|24.02|><|24.02|> ataupun kita sebutkan event<|25.38|><|25.38|> adalah usaha kerajaan ke arah merealisasikan<|28.40|><|endoftext|>
|
69 |
+
```
|
70 |
+
|
71 |
+
### Transcribe word level timestamp
|
72 |
+
|
73 |
+
You must use `transcribeprecise` for the task, or `<|transcribeprecise|>` token,
|
74 |
+
|
75 |
+
```python
|
76 |
+
from datasets import Audio
|
77 |
+
import requests
|
78 |
+
|
79 |
+
sr = 16000
|
80 |
+
audio = Audio(sampling_rate=sr)
|
81 |
+
|
82 |
+
r = requests.get('https://github.com/mesolitica/malaya-speech/raw/master/speech/assembly.mp3')
|
83 |
+
y = audio.decode_example(audio.encode_example(r.content))['array']
|
84 |
+
|
85 |
+
with torch.no_grad():
|
86 |
+
p = processor([y], return_tensors='pt')
|
87 |
+
p['input_features'] = p['input_features'].to(torch.bfloat16)
|
88 |
+
r = model.generate(
|
89 |
+
p['input_features'].cuda(),
|
90 |
+
output_scores=True,
|
91 |
+
return_dict_in_generate=True,
|
92 |
+
language='ms',
|
93 |
+
return_timestamps=True, task = 'transcribeprecise')
|
94 |
+
|
95 |
+
tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(r.sequences[0]))
|
96 |
+
```
|
97 |
+
|
98 |
+
```
|
99 |
+
<|startoftranscript|><|ms|><|transcribeprecise|><|0.02|> Assembly<|1.20|><|1.56|> on<|1.64|><|1.74|> Aging<|2.04|><|2.14|> di<|2.22|><|2.26|> Vienna<|2.50|><|2.72|> Australia<|3.12|><|4.26|> yang<|4.38|><|4.42|> telah<|4.58|><|4.62|> diadakan<|5.08|><|5.16|> pada<|5.30|><|5.36|> tahun<|5.60|><|5.62|> 1982<|6.92|><|7.12|> dan<|7.24|><|7.32|> berasaskan<|7.88|><|7.98|> unjuran<|8.36|><|8.42|> tersebut<|8.80|><|8.88|> maka<|9.06|><|9.12|> Jabatan<|9.48|><|9.56|> Perangkaan<|9.98|><|10.04|> Malaysia<|10.36|><|10.84|> menganggarkan<|11.56|><|11.98|> menjelang<|12.34|><|12.40|> tahun<|12.64|><|12.66|> 2035<|14.08|><|14.50|> sejumlah<|14.96|><|14.98|> 15%<|16.14|><|16.26|> penduduk<|16.62|><|16.68|> kita<|16.90|><|17.02|> adalah<|17.30|><|17.40|> daripada<|17.80|><|17.86|> kalangan<|18.16|><|18.22|> warga<|18.40|><|18.46|> emas.<|18.68|><|19.24|> Untuk<|19.40|><|19.46|> makluman<|19.86|><|20.64|> Tuan<|20.76|><|20.82|> Yang<|20.90|><|20.94|> Pertua<|21.14|><|21.20|> dan<|21.28|><|21.34|> juga<|21.50|><|21.58|> Alia<|21.70|><|21.76|> Mbah<|21.88|><|21.92|> Ahmad,<|22.08|><|22.22|> pembangunan<|22.66|><|22.72|> sistem<|23.00|><|23.06|> pendaftaran<|23.48|><|23.54|> warga<|23.72|><|23.78|> emas<|23.98|><|24.06|> ataupun<|24.36|><|24.42|> kita<|24.56|><|24.62|> sebutkan<|24.94|><|25.08|> event<|25.38|><|25.86|> adalah<|26.10|><|26.18|> usaha<|26.46|><|26.60|> kerajaan<|27.06|><|27.16|> kearah<|27.44|><|27.50|> merealisasikan<|28.36|><|28.86|> objektif<|29.36|><|29.42|> yang<|29.52|><|29.56|> telah<|29.72|><|29.76|> digarakan<|30.00|><|endoftext|>
|
100 |
+
```
|
101 |
+
|