Update README.md

f7e8687 verified 5 months ago

8.07 kB

	---
	license: apache-2.0
	language:
	- ug
	base_model:
	- facebook/nllb-200-distilled-600M
	pipeline_tag: automatic-speech-recognition
	---
	# Model Card for Model ID

	<!-- Provide a quick summary of what the model is/does. -->

	This model is a fine-tuned version of lucio/xls-r-uyghur-cv7 which based on the facebook/wav2vec2-xls-r-300m, the MOZILLA-FOUNDATION/COMMON_VOICE_18_0 - UG dataset used for fine-tuning.

	It achieves the following results:

	Loss: 1.0882
	## Model Details
	Detail of the model see facebook/wav2vec2-xls-r-300m.

	### Model Description
	The model vocabulary consists of the alphabetic characters of the Perso-Arabic script for the Uyghur language, with punctuation removed.

	### Intended uses & limitations
	This model is expected to be of some utility for low-fidelity use cases such as:

	Draft video captions

	Indexing of recorded broadcasts

	The model is not reliable enough to use as a substitute for live captions for accessibility purposes, and it should not be used in a manner that would infringe the privacy of any of the contributors to the Common Voice dataset nor any other speakers.

	### Training and evaluation data
	The combination of train and dev of common voice official splits were used as training data.

	The official test split was used for final evaluation.

	### Training procedure
	The featurization layers of the XLS-R model are frozen while tuning a final CTC/LM layer on the Uyghur CV18 example sentences.


	### Training hyperparameters
	The following hyperparameters were used during training:

	group_by_length=True,
	per_device_train_batch_size=8,
	evaluation_strategy="no",
	eval_strategy="steps",
	num_train_epochs=3,
	fp16=True,
	save_steps=500,
	eval_steps=500,
	logging_steps=500,
	learning_rate=1e-4,
	warmup_steps=500,
	save_total_limit=2

	<!-- Provide a longer summary of what this model is. -->
	### How to Training:
	#### You may create a python document named as "fine_tuen.py".
	#### "fine_tune.py" shoud including the following contents:

	```
	import torchaudio
	import torch
	from datasets import load_dataset, Audio
	from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
	from transformers import DefaultDataCollator
	from transformers import TrainingArguments, Trainer
	from dataclasses import dataclass
	from typing import Dict, List, Union
	import librosa

	# 加载数据集
	dataset = load_dataset("MOZILLA-FOUNDATION/COMMON_VOICE_18_0 - UG", split="train")
	dataset = dataset.cast_column("path", Audio())

	# 加载处理器
	processor = Wav2Vec2Processor.from_pretrained("lucio/xls-r-uyghur-cv7")

	def preprocess_function(batch):

	audio = batch["path"]

	if audio["sampling_rate"] != 16000:
	resampler = torchaudio.transforms.Resample(audio["sampling_rate"], 16000)
	waveform = torch.tensor(audio["array"], dtype=torch.float32)
	audio["array"] = resampler(waveform).numpy()

	# 确保所有音频长度相同
	audio_array = librosa.util.fix_length(audio["array"], size=200000)

	# 将音频数组转换为张量
	audio_tensor = torch.from_numpy(audio_array).float()

	inputs = processor(
	audio_tensor,
	sampling_rate=16000,
	return_tensors="pt",
	padding="longest"
	)

	with processor.as_target_processor():
	labels = processor(batch["sentence"]).input_ids

	batch["input_values"] = inputs.input_values[0] # 移除批次维度
	batch["labels"] = labels
	return batch


	# 应用预处理
	dataset = dataset.map(preprocess_function, remove_columns=["path", "sentence"])
	model = Wav2Vec2ForCTC.from_pretrained("lucio/xls-r-uyghur-cv7", ctc_loss_reduction="mean", pad_token_id=processor.tokenizer.pad_token_id)

	# 冻结特征提取器参数
	model.freeze_feature_encoder()

	training_args = TrainingArguments(

	output_dir="./wav2vec2_finetune",
	group_by_length=True,
	per_device_train_batch_size=8,
	evaluation_strategy="no",
	eval_strategy="steps",
	num_train_epochs=3,
	fp16=True,
	save_steps=500,
	eval_steps=500,
	logging_steps=500,
	learning_rate=1e-4,
	warmup_steps=500,
	save_total_limit=2,
	)

	@dataclass
	class DataCollatorCTCWithPadding:
	processor: Wav2Vec2Processor
	padding: Union[bool, str] = True
	def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
	# 提取所有的 input_values 并转换为张量
	input_features = [torch.tensor(feature["input_values"]) for feature in features]

	# 找到最短的序列长度
	min_length = min(map(len, input_features))

	# 截断 input_values
	input_features = [feature[:min_length] for feature in input_features]

	# 填充 input_values
	input_features = torch.nn.utils.rnn.pad_sequence(input_features, batch_first=True)

	# 获取所有的标签序列并转换为张量
	label_features = [torch.tensor(feature["labels"]) for feature in features]

	# 填充标签
	labels_batch = torch.nn.utils.rnn.pad_sequence(label_features, batch_first=True, padding_value=-100)

	batch = {
	"input_values": input_features,
	"labels": labels_batch,
	}

	return batch
	# 使用自定义的数据整理器
	data_collator = DataCollatorCTCWithPadding(processor=processor, padding=True)

	# 更新 Trainer
	trainer = Trainer(
	model=model,
	args=training_args,
	train_dataset=dataset,
	tokenizer=processor.feature_extractor,
	data_collator=data_collator
	)

	trainer.train()

	model.save_pretrained("fine_tuned_wav2vec2_UGASR_model") #微调后的模型名称

	processor.save_pretrained("fine_tuned_wav2vec2_UGASR_model") #到这里微调工作全部结束，可对微调后的"fine_tuned_wav2vec2_UGASR_model"模型进行进一步的评估。
	```

	#### above is the full contents of fine_tune.py

	- Developed by: Mamajtan Abudkader 2024.9.10
	- Model type: ASR
	- Language(s) (NLP): Uyghur
	- License: Apache2
	- Finetuned from model: lucio/xls-r-uyghur-cv7



	## Uses

	<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
	This model is used for auto speech recognition of uyghur language in arabic scripts.


	## How to Get Started with the Model

	## Use the code below to get started with the model, you may create a python document named as "asr.py".
	## "asr.py" should include the following contents:
	```
	from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
	import librosa
	import torch
	import time

	stt = time.time()

	# 指定模型的路径
	model_path = "mamatjan/xls-r-uyghur-cv18"

	# 加载模型和处理器
	model = Wav2Vec2ForCTC.from_pretrained(model_path)
	processor = Wav2Vec2Processor.from_pretrained(model_path)

	# 读取音频文件并重采样到16kHz
	audio_input, sampling_rate = librosa.load("exmaple.mp3", sr=None) #"exmaple.mp3"是需要音转文的音频文件，确保该文件和asr.py文件在同一个目录或者给出"exmaple.mp3"文件的完整路径。
	if sampling_rate != 16000:

	audio_input = librosa.resample(audio_input, orig_sr=sampling_rate, target_sr=16000)

	sampling_rate = 16000

	# 使用处理器处理音频数据
	inputs = processor(audio_input, return_tensors="pt", sampling_rate=sampling_rate, padding=True)

	# 使用模型进行预测
	with torch.no_grad():
	logits = model(inputs.input_values).logits

	# 使用 CTC 解码器解码预测结果
	predicted_ids = torch.argmax(logits, dim=-1)
	transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
	waqit = time.time()-stt
	print("======سەرىپ قىلغان ۋاقىت===============") # 打印消耗的时间
	print(f"ۋاقىت: {waqit:.2f} سىكۇنت") #打印（时间：.*秒）
	print(transcription[0]) # 打印音转文维吾尔语文本，至此asr.py的全部内容运行完了。
	```
	## above is the full content of the "asr.py".

	## Hardware

	NVIDIA Geforce 3060 ti been used for the training on Win10 system for 14 hours.