xls-r-uyghur-cv18 / README.md
mamatjan's picture
Update README.md
f7e8687 verified
---
license: apache-2.0
language:
- ug
base_model:
- facebook/nllb-200-distilled-600M
pipeline_tag: automatic-speech-recognition
---
# Model Card for Model ID
<!-- Provide a quick summary of what the model is/does. -->
This model is a fine-tuned version of lucio/xls-r-uyghur-cv7 which based on the facebook/wav2vec2-xls-r-300m, the MOZILLA-FOUNDATION/COMMON_VOICE_18_0 - UG dataset used for fine-tuning.
It achieves the following results:
Loss: 1.0882
## Model Details
Detail of the model see facebook/wav2vec2-xls-r-300m.
### Model Description
The model vocabulary consists of the alphabetic characters of the Perso-Arabic script for the Uyghur language, with punctuation removed.
### Intended uses & limitations
This model is expected to be of some utility for low-fidelity use cases such as:
Draft video captions
Indexing of recorded broadcasts
The model is not reliable enough to use as a substitute for live captions for accessibility purposes, and it should not be used in a manner that would infringe the privacy of any of the contributors to the Common Voice dataset nor any other speakers.
### Training and evaluation data
The combination of train and dev of common voice official splits were used as training data.
The official test split was used for final evaluation.
### Training procedure
The featurization layers of the XLS-R model are frozen while tuning a final CTC/LM layer on the Uyghur CV18 example sentences.
### Training hyperparameters
The following hyperparameters were used during training:
group_by_length=True,
per_device_train_batch_size=8,
evaluation_strategy="no",
eval_strategy="steps",
num_train_epochs=3,
fp16=True,
save_steps=500,
eval_steps=500,
logging_steps=500,
learning_rate=1e-4,
warmup_steps=500,
save_total_limit=2
<!-- Provide a longer summary of what this model is. -->
### How to Training:
#### You may create a python document named as "fine_tuen.py".
#### "fine_tune.py" shoud including the following contents:
```
import torchaudio
import torch
from datasets import load_dataset, Audio
from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
from transformers import DefaultDataCollator
from transformers import TrainingArguments, Trainer
from dataclasses import dataclass
from typing import Dict, List, Union
import librosa
# 加载数据集
dataset = load_dataset("MOZILLA-FOUNDATION/COMMON_VOICE_18_0 - UG", split="train")
dataset = dataset.cast_column("path", Audio())
# 加载处理器
processor = Wav2Vec2Processor.from_pretrained("lucio/xls-r-uyghur-cv7")
def preprocess_function(batch):
audio = batch["path"]
if audio["sampling_rate"] != 16000:
resampler = torchaudio.transforms.Resample(audio["sampling_rate"], 16000)
waveform = torch.tensor(audio["array"], dtype=torch.float32)
audio["array"] = resampler(waveform).numpy()
# 确保所有音频长度相同
audio_array = librosa.util.fix_length(audio["array"], size=200000)
# 将音频数组转换为张量
audio_tensor = torch.from_numpy(audio_array).float()
inputs = processor(
audio_tensor,
sampling_rate=16000,
return_tensors="pt",
padding="longest"
)
with processor.as_target_processor():
labels = processor(batch["sentence"]).input_ids
batch["input_values"] = inputs.input_values[0] # 移除批次维度
batch["labels"] = labels
return batch
# 应用预处理
dataset = dataset.map(preprocess_function, remove_columns=["path", "sentence"])
model = Wav2Vec2ForCTC.from_pretrained("lucio/xls-r-uyghur-cv7", ctc_loss_reduction="mean", pad_token_id=processor.tokenizer.pad_token_id)
# 冻结特征提取器参数
model.freeze_feature_encoder()
training_args = TrainingArguments(
output_dir="./wav2vec2_finetune",
group_by_length=True,
per_device_train_batch_size=8,
evaluation_strategy="no",
eval_strategy="steps",
num_train_epochs=3,
fp16=True,
save_steps=500,
eval_steps=500,
logging_steps=500,
learning_rate=1e-4,
warmup_steps=500,
save_total_limit=2,
)
@dataclass
class DataCollatorCTCWithPadding:
processor: Wav2Vec2Processor
padding: Union[bool, str] = True
def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
# 提取所有的 input_values 并转换为张量
input_features = [torch.tensor(feature["input_values"]) for feature in features]
# 找到最短的序列长度
min_length = min(map(len, input_features))
# 截断 input_values
input_features = [feature[:min_length] for feature in input_features]
# 填充 input_values
input_features = torch.nn.utils.rnn.pad_sequence(input_features, batch_first=True)
# 获取所有的标签序列并转换为张量
label_features = [torch.tensor(feature["labels"]) for feature in features]
# 填充标签
labels_batch = torch.nn.utils.rnn.pad_sequence(label_features, batch_first=True, padding_value=-100)
batch = {
"input_values": input_features,
"labels": labels_batch,
}
return batch
# 使用自定义的数据整理器
data_collator = DataCollatorCTCWithPadding(processor=processor, padding=True)
# 更新 Trainer
trainer = Trainer(
model=model,
args=training_args,
train_dataset=dataset,
tokenizer=processor.feature_extractor,
data_collator=data_collator
)
trainer.train()
model.save_pretrained("fine_tuned_wav2vec2_UGASR_model") #微调后的模型名称
processor.save_pretrained("fine_tuned_wav2vec2_UGASR_model") #到这里微调工作全部结束,可对微调后的"fine_tuned_wav2vec2_UGASR_model"模型进行进一步的评估。
```
#### above is the full contents of fine_tune.py
- **Developed by:** Mamajtan Abudkader 2024.9.10
- **Model type:** ASR
- **Language(s) (NLP):** Uyghur
- **License:** Apache2
- **Finetuned from model:** lucio/xls-r-uyghur-cv7
## Uses
<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
This model is used for auto speech recognition of uyghur language in arabic scripts.
## How to Get Started with the Model
## Use the code below to get started with the model, you may create a python document named as "asr.py".
## "asr.py" should include the following contents:
```
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
import librosa
import torch
import time
stt = time.time()
# 指定模型的路径
model_path = "mamatjan/xls-r-uyghur-cv18"
# 加载模型和处理器
model = Wav2Vec2ForCTC.from_pretrained(model_path)
processor = Wav2Vec2Processor.from_pretrained(model_path)
# 读取音频文件并重采样到16kHz
audio_input, sampling_rate = librosa.load("exmaple.mp3", sr=None) #"exmaple.mp3"是需要音转文的音频文件,确保该文件和asr.py文件在同一个目录或者给出"exmaple.mp3"文件的完整路径。
if sampling_rate != 16000:
audio_input = librosa.resample(audio_input, orig_sr=sampling_rate, target_sr=16000)
sampling_rate = 16000
# 使用处理器处理音频数据
inputs = processor(audio_input, return_tensors="pt", sampling_rate=sampling_rate, padding=True)
# 使用模型进行预测
with torch.no_grad():
logits = model(inputs.input_values).logits
# 使用 CTC 解码器解码预测结果
predicted_ids = torch.argmax(logits, dim=-1)
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
waqit = time.time()-stt
print("======سەرىپ قىلغان ۋاقىت===============") # 打印消耗的时间
print(f"ۋاقىت: {waqit:.2f} سىكۇنت") #打印(时间:*.**秒)
print(transcription[0]) # 打印音转文维吾尔语文本,至此asr.py的全部内容运行完了。
```
## above is the full content of the "asr.py".
## Hardware
NVIDIA Geforce 3060 ti been used for the training on Win10 system for 14 hours.