kresnik's picture
Update README.md
1d50f59
metadata
language: ko
datasets:
  - kresnik/zeroth_korean
tags:
  - speech
  - audio
  - automatic-speech-recognition
license: apache-2.0
model-index:
  - name: Wav2Vec2 XLSR Korean
    results:
      - task:
          name: Automatic Speech Recognition
          type: automatic-speech-recognition
        dataset:
          name: Zeroth Korean
          type: kresnik/zeroth_korean
          args: clean
        metrics:
          - name: Test WER
            type: wer
            value: 4.74
          - name: Test CER
            type: cer
            value: 1.78

Evaluation on Zeroth-Korean ASR corpus

Google colab notebook(Korean)

from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
from datasets import load_dataset
import soundfile as sf
import torch
from jiwer import wer

processor = Wav2Vec2Processor.from_pretrained("kresnik/wav2vec2-large-xlsr-korean")

model = Wav2Vec2ForCTC.from_pretrained("kresnik/wav2vec2-large-xlsr-korean").to('cuda')

ds = load_dataset("kresnik/zeroth_korean", "clean")

test_ds = ds['test']

def map_to_array(batch):
    speech, _ = sf.read(batch["file"])
    batch["speech"] = speech
    return batch

test_ds = test_ds.map(map_to_array)

def map_to_pred(batch):
    inputs = processor(batch["speech"], sampling_rate=16000, return_tensors="pt", padding="longest")
    input_values = inputs.input_values.to("cuda")
    
    with torch.no_grad():
        logits = model(input_values).logits

    predicted_ids = torch.argmax(logits, dim=-1)
    transcription = processor.batch_decode(predicted_ids)
    batch["transcription"] = transcription
    return batch

result = test_ds.map(map_to_pred, batched=True, batch_size=16, remove_columns=["speech"])

print("WER:", wer(result["text"], result["transcription"]))

Expected WER: 4.74%

Expected CER: 1.78%