File size: 3,280 Bytes

---
language: ko
tags:
- audio
- speech-recognition
- pronunciation-assessment
license: apache-2.0
datasets:
- AI_Hub
metrics:
- 1~5
widget:
- text: 안녕하세요. 오늘 날씨가 좋습니다.
  example_title: Sample Korean Sentence
- text: 영어는 세계 공용어입니다.
  example_title: Another Sample Sentence
pipeline_tag: audio-classification
---

# Whisper Fine-tuned Pronunciation Scorer

This model assesses pronunciation quality for Korean speech. It's based on the openai/whisper-small model, fine-tuned using the Korea AI-Hub (https://www.aihub.or.kr/) foreigner Korean pronunciation evaluation dataset.

# Model Description
The Pronunciation Scorer takes audio input along with its corresponding text transcript and provides a Korean pronunciation score on a scale of 1 to 5. It utilizes the encoder-decoder architecture of the Whisper model to extract speech features and employs an additional linear layer to predict the pronunciation score.

# How to Use
To use this model, follow these steps:

1. Install required libraries
2. Load the model and processor
3. Prepare your audio file and text transcript
4. Predict the pronunciation score

Here's a detailed example of how to use the model:

```
import torch
import torchaudio
from transformers import WhisperProcessor, WhisperForConditionalGeneration
import torch.nn as nn

class WhisperPronunciationScorer(nn.Module):
    def __init__(self, pretrained_model):
        super().__init__()
        self.whisper = pretrained_model
        self.score_head = nn.Linear(self.whisper.config.d_model, 1)

    def forward(self, input_features, labels=None):
        outputs = self.whisper(input_features, labels=labels, output_hidden_states=True)
        last_hidden_state = outputs.decoder_hidden_states[-1]
        scores = self.score_head(last_hidden_state.mean(dim=1)).squeeze()
        return scores

def load_model(model_path, device):
    model_name = "openai/whisper-small"
    processor = WhisperProcessor.from_pretrained(model_name)
    pretrained_model = WhisperForConditionalGeneration.from_pretrained(model_name)
    model = WhisperPronunciationScorer(pretrained_model).to(device)
    model.load_state_dict(torch.load(model_path, map_location=device))
    model.eval()
    return model, processor

def predict_pronunciation_score(model, processor, audio_path, transcript, device):
    # Load and preprocess audio
    audio, sr = torchaudio.load(audio_path)
    if sr != 16000:
        audio = torchaudio.functional.resample(audio, sr, 16000)
    input_features = processor(audio.squeeze().numpy(), sampling_rate=16000, return_tensors="pt").input_features.to(device)
    
    # Prepare transcript
    labels = processor(text=transcript, return_tensors="pt").input_ids.to(device)
    
    # Predict score
    with torch.no_grad():
        score = model(input_features, labels)
    return score.item()

# Load model
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model_path = "path/to/your/model.pth"
model, processor = load_model(model_path, device)

# Run prediction
audio_path = "path/to/your/audio.wav"
transcript = "안녕하세요"
score = predict_pronunciation_score(model, processor, audio_path, transcript, device)
print(f"Predicted pronunciation score: {score:.2f}")
```