mishig HF staff

Upload README.md

c8a9eeb about 3 years ago

3.97 kB

	---
	language: en
	datasets:
	- librispeech_asr
	tags:
	- audio
	- automatic-speech-recognition
	license: apache-2.0
	widget:
	- example_title: Librispeech sample 1
	src: https://cdn-media.huggingface.co/speech_samples/sample1.flac
	- example_title: Librispeech sample 2
	src: https://cdn-media.huggingface.co/speech_samples/sample2.flac
	---

	# Wav2Vec2-Base-960h

	This repository is a reimplementation of [official Facebook’s wav2vec](https://huggingface.co/facebook/wav2vec2-base-960h).
	There is no description of converting the wav2vec [pretrain model](https://github.com/pytorch/fairseq/tree/master/examples/wav2vec#wav2vec-20) to a pytorch.bin file.
	We are rebuilding pytorch.bin from the pretrain model.
	Here is the conversion method.

	```bash
	pip install transformers[sentencepiece]
	pip install fairseq -U

	git clone https://github.com/huggingface/transformers.git
	cp transformers/src/transformers/models/wav2vec2/convert_wav2vec2_original_pytorch_checkpoint_to_pytorch.py .

	wget https://dl.fbaipublicfiles.com/fairseq/wav2vec/wav2vec_small_960h.pt -O ./wav2vec_small_960h.pt
	mkdir dict
	wget https://dl.fbaipublicfiles.com/fairseq/wav2vec/dict.ltr.txt

	mkdir outputs
	python convert_wav2vec2_original_pytorch_checkpoint_to_pytorch.py --pytorch_dump_folder_path ./outputs --checkpoint_path ./wav2vec_small_960h.pt --dict_path ./dict
	```

	# Usage

	To transcribe audio files the model can be used as a standalone acoustic model as follows:

	```python
	from transformers import Wav2Vec2Tokenizer, Wav2Vec2ForCTC
	from datasets import load_dataset
	import soundfile as sf
	import torch

	# load model and tokenizer
	tokenizer = Wav2Vec2Tokenizer.from_pretrained("facebook/wav2vec2-base-960h")
	model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h")

	# define function to read in sound file
	def map_to_array(batch):
	speech, _ = sf.read(batch["file"])
	batch["speech"] = speech
	return batch

	# load dummy dataset and read soundfiles
	ds = load_dataset("patrickvonplaten/librispeech_asr_dummy", "clean", split="validation")
	ds = ds.map(map_to_array)

	# tokenize
	input_values = tokenizer(ds["speech"][:2], return_tensors="pt", padding="longest").input_values # Batch size 1

	# retrieve logits
	logits = model(input_values).logits

	# take argmax and decode
	predicted_ids = torch.argmax(logits, dim=-1)
	transcription = tokenizer.batch_decode(predicted_ids)
	```

	## Evaluation

	This code snippet shows how to evaluate facebook/wav2vec2-base-960h on LibriSpeech's "clean" and "other" test data.

	```python
	from datasets import load_dataset
	from transformers import Wav2Vec2ForCTC, Wav2Vec2Tokenizer
	import soundfile as sf
	import torch
	from jiwer import wer


	librispeech_eval = load_dataset("librispeech_asr", "clean", split="test")

	model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h").to("cuda")
	tokenizer = Wav2Vec2Tokenizer.from_pretrained("facebook/wav2vec2-base-960h")

	def map_to_array(batch):
	speech, _ = sf.read(batch["file"])
	batch["speech"] = speech
	return batch

	librispeech_eval = librispeech_eval.map(map_to_array)

	def map_to_pred(batch):
	input_values = tokenizer(batch["speech"], return_tensors="pt", padding="longest").input_values
	with torch.no_grad():
	logits = model(input_values.to("cuda")).logits

	predicted_ids = torch.argmax(logits, dim=-1)
	transcription = tokenizer.batch_decode(predicted_ids)
	batch["transcription"] = transcription
	return batch

	result = librispeech_eval.map(map_to_pred, batched=True, batch_size=1, remove_columns=["speech"])

	print("WER:", wer(result["text"], result["transcription"]))
	```

	Result (WER):

	\| "clean" \| "other" \|
	\|---\|---\|
	\| 3.4 \| 8.6 \|


	# Reference


	[Facebook's Wav2Vec2](https://ai.facebook.com/blog/wav2vec-20-learning-the-structure-of-speech-from-raw-audio/)

	[Facebook's huggingface Wav2Vec2](https://huggingface.co/facebook/wav2vec2-base-960h)

	[Paper](https://arxiv.org/abs/2006.11477)