File size: 4,180 Bytes
77dfe2e
 
 
 
 
880f01b
77dfe2e
 
 
 
 
880f01b
77dfe2e
8b159b3
880f01b
 
8b159b3
 
 
880f01b
77dfe2e
880f01b
77dfe2e
880f01b
 
 
77dfe2e
 
 
 
 
880f01b
77dfe2e
880f01b
77dfe2e
880f01b
77dfe2e
 
 
880f01b
77dfe2e
 
 
880f01b
77dfe2e
 
 
880f01b
77dfe2e
 
 
880f01b
 
 
 
77dfe2e
880f01b
 
 
77dfe2e
880f01b
 
77dfe2e
880f01b
 
 
 
 
77dfe2e
880f01b
c94b853
77dfe2e
880f01b
77dfe2e
880f01b
77dfe2e
880f01b
77dfe2e
880f01b
77dfe2e
880f01b
77dfe2e
 
 
880f01b
 
 
 
 
 
77dfe2e
880f01b
77dfe2e
880f01b
 
 
 
 
 
77dfe2e
 
 
 
 
 
 
880f01b
77dfe2e
 
 
880f01b
77dfe2e
 
 
880f01b
77dfe2e
 
 
880f01b
77dfe2e
 
 
 
 
880f01b
77dfe2e
880f01b
77dfe2e
 
 
880f01b
77dfe2e
 
 
 
 
880f01b
77dfe2e
 
 
880f01b
77dfe2e
 
 
880f01b
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
---
library_name: transformers
tags: []
---

# Model Card for wav2vec2-large-xlsr-persian-fine-tuned

## Model Details

### Model Description

This model is a fine-tuned version of `facebook/wav2vec2-large-xlsr-53` on Persian language data from the Mozilla Common Voice Dataset. The model is fine-tuned for automatic speech recognition (ASR) tasks.

- **Developed by:** Alireza Dastmalchi Saei
- **Funded by:** -
- **Shared by:** -
- **Model type:** wav2vec2
- **Language(s) (NLP):** Persian
- **License:** MIT
- **Finetuned from model:** wav2vec2-large-xlsr-53

### Model Sources

- **Repository:** [Model Repository](https://huggingface.co/AlirezaSaei/wav2vec2-large-xlsr-persian-fine-tuned)
- **Paper:** -
- **Demo:** -

## Uses

### Direct Use

This model can be used directly for transcribing Persian speech to text but it needs to be further fine-tuned with data.

### Downstream Use

The model can be fine-tuned further for specific ASR tasks or integrated into larger speech-processing pipelines.

### Out-of-Scope Use

The model is not suitable for languages other than Persian and may not perform well on noisy audio or speech with heavy accents not represented in the training data.

## Bias, Risks, and Limitations

The model is trained on a dataset that may not cover all variations of the Persian language, leading to potential biases in recognizing less represented dialects or accents.

### Recommendations

Users should be aware of the biases, risks, and limitations. Further fine-tuning on diverse datasets is recommended to mitigate these biases.

## How to Get Started with the Model

```python
from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
import torch
import torchaudio

# Load processor and model
processor = Wav2Vec2Processor.from_pretrained("AlirezaSaei/wav2vec2-large-xlsr-persian-fine-tuned")
model = Wav2Vec2ForCTC.from_pretrained("AlirezaSaei/wav2vec2-large-xlsr-persian-fine-tuned")

# Load audio file
audio_input, _ = torchaudio.load("path_to_audio.wav")

# Preprocess and predict
inputs = processor(audio_input, sampling_rate=16000, return_tensors="pt", padding=True)
logits = model(**inputs).logits
predicted_ids = torch.argmax(logits, dim=-1)
transcription = processor.batch_decode(predicted_ids)

print("Transcription:", transcription)
```

## Training Details

### Training Data

The model is fine-tuned on the Mozilla Common Voice Dataset. The training data includes Persian speech samples, with lengths filtered between 4 and 6 seconds for training and up to 15 seconds for testing.

### Training Procedure

The audio is resampled from 48000 Hz to 16000 Hz. The tokenizer, feature extractor, and processor are defined using the `Wav2Vec2CTCTokenizer`, `Wav2Vec2FeatureExtractor`, and `Wav2Vec2Processor` classes.

#### Training Hyperparameters

- **Training regime:** fp16 mixed precision
- **Batch Size:** 12
- **Num Epochs:** 5
- **Learning Rate:** 1e-4
- **Gradient Accumulation Steps:** 2
- **Warmup Steps:** 1000

### Speeds, Sizes, Times

- **Training Files:** 2217
- **Testing Files:** 5212
- **Training Time (minutes):** 19.67
- **Total Parameters:** 315,479,720
- **Trainable Parameters:** 311,269,544
- **WER:** 1.0

## Evaluation

### Testing Data, Factors & Metrics

#### Testing Data

The model is evaluated on a subset of the Mozilla Common Voice Dataset.

#### Factors

Evaluation is disaggregated by different lengths of audio samples.

#### Metrics

Word Error Rate (WER) is used as the evaluation metric. It measures the percentage of words that are incorrectly predicted.

### Results

The model achieves a WER of 1.0 on the test data.

## Environmental Impact

Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).

- **Hardware Type:** Colab T4 GPU

## Technical Specifications

### Model Architecture and Objective

The model uses the Wav2Vec2 architecture, which is designed for automatic speech recognition.

### Compute Infrastructure

#### Hardware

Colab T4 GPU

#### Software

Python Notebook (.ipynb)

## Model Card Contact

For further information, contact me.