--- language: en datasets: - librispeech_asr tags: - audio - automatic-speech-recognition license: apache-2.0 widget: - label: Librispeech sample 1 src: https://cdn-media.huggingface.co/speech_samples/sample1.flac - label: Librispeech sample 2 src: https://cdn-media.huggingface.co/speech_samples/sample2.flac --- # Wav2Vec2-Base-960h This repository is a reimplementation of [official Facebook’s wav2vec](https://huggingface.co/facebook/wav2vec2-base-960h). There is no description of converting the wav2vec [pretrain model](https://github.com/pytorch/fairseq/tree/master/examples/wav2vec#wav2vec-20) to a pytorch.bin file. We are rebuilding pytorch.bin from the pretrain model. Here is the conversion method. ```bash pip install transformers[sentencepiece] pip install fairseq -U git clone https://github.com/huggingface/transformers.git cp transformers/src/transformers/models/wav2vec2/convert_wav2vec2_original_pytorch_checkpoint_to_pytorch.py . wget https://dl.fbaipublicfiles.com/fairseq/wav2vec/wav2vec_small_960h.pt -O ./finetuning/wav2vec_small_960h.pt mkdir dict wget https://dl.fbaipublicfiles.com/fairseq/wav2vec/dict.ltr.txt mkdir outputs python convert_wav2vec2_original_pytorch_checkpoint_to_pytorch.py --pytorch_dump_folder_path ./outputs --checkpoint_path ./wav2vec_small_960h.pt --dict_path ./dict ``` # Usage To transcribe audio files the model can be used as a standalone acoustic model as follows: ```python from transformers import Wav2Vec2Tokenizer, Wav2Vec2ForCTC from datasets import load_dataset import soundfile as sf import torch # load model and tokenizer tokenizer = Wav2Vec2Tokenizer.from_pretrained("facebook/wav2vec2-base-960h") model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h") # define function to read in sound file def map_to_array(batch): speech, _ = sf.read(batch["file"]) batch["speech"] = speech return batch # load dummy dataset and read soundfiles ds = load_dataset("patrickvonplaten/librispeech_asr_dummy", "clean", split="validation") ds = ds.map(map_to_array) # tokenize input_values = tokenizer(ds["speech"][:2], return_tensors="pt", padding="longest").input_values # Batch size 1 # retrieve logits logits = model(input_values).logits # take argmax and decode predicted_ids = torch.argmax(logits, dim=-1) transcription = tokenizer.batch_decode(predicted_ids) ``` ## Evaluation This code snippet shows how to evaluate **facebook/wav2vec2-base-960h** on LibriSpeech's "clean" and "other" test data. ```python from datasets import load_dataset from transformers import Wav2Vec2ForCTC, Wav2Vec2Tokenizer import soundfile as sf import torch from jiwer import wer librispeech_eval = load_dataset("librispeech_asr", "clean", split="test") model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h").to("cuda") tokenizer = Wav2Vec2Tokenizer.from_pretrained("facebook/wav2vec2-base-960h") def map_to_array(batch): speech, _ = sf.read(batch["file"]) batch["speech"] = speech return batch librispeech_eval = librispeech_eval.map(map_to_array) def map_to_pred(batch): input_values = tokenizer(batch["speech"], return_tensors="pt", padding="longest").input_values with torch.no_grad(): logits = model(input_values.to("cuda")).logits predicted_ids = torch.argmax(logits, dim=-1) transcription = tokenizer.batch_decode(predicted_ids) batch["transcription"] = transcription return batch result = librispeech_eval.map(map_to_pred, batched=True, batch_size=1, remove_columns=["speech"]) print("WER:", wer(result["text"], result["transcription"])) ``` *Result (WER)*: | "clean" | "other" | |---|---| | 3.4 | 8.6 | # Reference [Facebook's Wav2Vec2](https://ai.facebook.com/blog/wav2vec-20-learning-the-structure-of-speech-from-raw-audio/) [Facebook's huggingface Wav2Vec2](https://huggingface.co/facebook/wav2vec2-base-960h) [Paper](https://arxiv.org/abs/2006.11477)