|
--- |
|
library_name: transformers |
|
tags: [] |
|
--- |
|
|
|
# Model Card for wav2vec2-large-xlsr-persian-fine-tuned |
|
|
|
## Model Details |
|
|
|
### Model Description |
|
|
|
This model is a fine-tuned version of `facebook/wav2vec2-large-xlsr-53` on Persian language data from the Mozilla Common Voice Dataset. The model is fine-tuned for automatic speech recognition (ASR) tasks. |
|
|
|
- **Developed by:** Alireza Dastmalchi Saei |
|
- **Funded by:** - |
|
- **Shared by:** - |
|
- **Model type:** wav2vec2 |
|
- **Language(s) (NLP):** Persian |
|
- **License:** MIT |
|
- **Finetuned from model:** wav2vec2-large-xlsr-53 |
|
|
|
### Model Sources |
|
|
|
- **Repository:** [Model Repository](https://huggingface.co/AlirezaSaei/wav2vec2-large-xlsr-persian-fine-tuned) |
|
- **Paper:** - |
|
- **Demo:** - |
|
|
|
## Uses |
|
|
|
### Direct Use |
|
|
|
This model can be used directly for transcribing Persian speech to text but it needs to be further fine-tuned with data. |
|
|
|
### Downstream Use |
|
|
|
The model can be fine-tuned further for specific ASR tasks or integrated into larger speech-processing pipelines. |
|
|
|
### Out-of-Scope Use |
|
|
|
The model is not suitable for languages other than Persian and may not perform well on noisy audio or speech with heavy accents not represented in the training data. |
|
|
|
## Bias, Risks, and Limitations |
|
|
|
The model is trained on a dataset that may not cover all variations of the Persian language, leading to potential biases in recognizing less represented dialects or accents. |
|
|
|
### Recommendations |
|
|
|
Users should be aware of the biases, risks, and limitations. Further fine-tuning on diverse datasets is recommended to mitigate these biases. |
|
|
|
## How to Get Started with the Model |
|
|
|
```python |
|
from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC |
|
import torch |
|
import torchaudio |
|
|
|
# Load processor and model |
|
processor = Wav2Vec2Processor.from_pretrained("AlirezaSaei/wav2vec2-large-xlsr-persian-fine-tuned") |
|
model = Wav2Vec2ForCTC.from_pretrained("AlirezaSaei/wav2vec2-large-xlsr-persian-fine-tuned") |
|
|
|
# Load audio file |
|
audio_input, _ = torchaudio.load("path_to_audio.wav") |
|
|
|
# Preprocess and predict |
|
inputs = processor(audio_input, sampling_rate=16000, return_tensors="pt", padding=True) |
|
logits = model(**inputs).logits |
|
predicted_ids = torch.argmax(logits, dim=-1) |
|
transcription = processor.batch_decode(predicted_ids) |
|
|
|
print("Transcription:", transcription) |
|
``` |
|
|
|
## Training Details |
|
|
|
### Training Data |
|
|
|
The model is fine-tuned on the Mozilla Common Voice Dataset. The training data includes Persian speech samples, with lengths filtered between 4 and 6 seconds for training and up to 15 seconds for testing. |
|
|
|
### Training Procedure |
|
|
|
The audio is resampled from 48000 Hz to 16000 Hz. The tokenizer, feature extractor, and processor are defined using the `Wav2Vec2CTCTokenizer`, `Wav2Vec2FeatureExtractor`, and `Wav2Vec2Processor` classes. |
|
|
|
#### Training Hyperparameters |
|
|
|
- **Training regime:** fp16 mixed precision |
|
- **Batch Size:** 12 |
|
- **Num Epochs:** 5 |
|
- **Learning Rate:** 1e-4 |
|
- **Gradient Accumulation Steps:** 2 |
|
- **Warmup Steps:** 1000 |
|
|
|
### Speeds, Sizes, Times |
|
|
|
- **Training Files:** 2217 |
|
- **Testing Files:** 5212 |
|
- **Training Time (minutes):** 19.67 |
|
- **Total Parameters:** 315,479,720 |
|
- **Trainable Parameters:** 311,269,544 |
|
- **WER:** 1.0 |
|
|
|
## Evaluation |
|
|
|
### Testing Data, Factors & Metrics |
|
|
|
#### Testing Data |
|
|
|
The model is evaluated on a subset of the Mozilla Common Voice Dataset. |
|
|
|
#### Factors |
|
|
|
Evaluation is disaggregated by different lengths of audio samples. |
|
|
|
#### Metrics |
|
|
|
Word Error Rate (WER) is used as the evaluation metric. It measures the percentage of words that are incorrectly predicted. |
|
|
|
### Results |
|
|
|
The model achieves a WER of 1.0 on the test data. |
|
|
|
## Environmental Impact |
|
|
|
Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700). |
|
|
|
- **Hardware Type:** Colab T4 GPU |
|
|
|
## Technical Specifications |
|
|
|
### Model Architecture and Objective |
|
|
|
The model uses the Wav2Vec2 architecture, which is designed for automatic speech recognition. |
|
|
|
### Compute Infrastructure |
|
|
|
#### Hardware |
|
|
|
Colab T4 GPU |
|
|
|
#### Software |
|
|
|
Python Notebook (.ipynb) |
|
|
|
## Model Card Contact |
|
|
|
For further information, contact me. |