Tony4's picture
Update README.md
f47799a verified
metadata
library_name: transformers
pipeline_tag: automatic-speech-recognition
tags:
  - whisper
  - speech
  - swedish
  - telephonic
  - transformers
datasets:
  - WMRNORDIC/swedish-telephonic-dataset
metrics:
  - wer
base_model: openai/whisper-small
base_model_relation: finetune
license: apache-2.0
language:
  - sv
  - en
model-index:
  - name: whisper-swedish-telephonic
    results:
      - task:
          type: automatic-speech-recognition
          name: Automatic Speech Recognition
        dataset:
          name: Swedish Telephonic Dataset
          type: custom
          split: test
        metrics:
          - name: Word Error Rate (WER)
            type: wer
            value: 0.17
          - name: Base Model WER (Comparison)
            type: wer
            value: 0.888

whisper-swedish-telephonic

Model Overview

whisper-swedish-telephonic is a fine-tuned version of OpenAI's Whisper-Small model, specifically designed for transcribing Swedish telephonic audio. The model is optimized for low-bandwidth, multi-speaker conversations such as call center interactions.

Key Features:

  • Language: Swedish (primary), with limited support for minor English segments.
  • Audio Types: Telephonic conversations, customer support recordings, and general low-bandwidth audio.
  • Sample Rate: 8kHz (resampled to 16kHz internally).
  • Special Tokens: Supports conversational markers, disfluencies, and speaker-specific tags.
  • Performance: Demonstrates significantly improved transcription accuracy over the base model for telephonic speech.

Dataset

The model was fine-tuned using the Swedish Telephonic Dataset, consisting of:

  • Duration: ~97 hours of annotated audio.
  • Domains: Call center recordings, customer service conversations.
  • Annotations:
    • Speaker IDs and timestamps.
    • Conversational tags: (()), ~, <overlap>.
    • Language switching: <lang:English>...</lang:English>.

Preprocessing:

  • Audio: Resampled to 16kHz.
  • Segmentations: Aligned with timestamps.
  • Special Tokens: Includes non-speech sounds like [cough], [laugh].

Model Performance

Word Error Rate (WER) Evaluation

The fine-tuned model was benchmarked against OpenAI's base Whisper-Small model using a Swedish telephonic test dataset containing 207 labeled speech segments.

Metric Fine-Tuned Model Base Whisper-Small
WER 0.170 0.888

Key Observations:

  • Fine-Tuned Model:
    • Excellent transcription accuracy for colloquial Swedish, domain-specific terminology, and long utterances.
    • Handles speaker-specific annotations and conversational markers effectively.
  • Base Model:
    • Struggles with Swedish syntax and domain-specific vocabulary.
    • Outputs nonsensical transcriptions for longer or complex sentences.

Example Transcriptions

Segment Ground Truth Fine-Tuned Model Base Model WER (Fine-Tuned) WER (Base)
1 så nu så nu so, no 0.000 1.000
2 nu record du båda va nu record du båda va nu rekordar du båda 0.000 0.400
3 ja jag kommer inte ihåg ja jag kommer inte ihåg i am coming to you 0.000 1.000
5 sen när då, sen alltid... inga gäster sen när då, sen alltid... inga gäster sen då, sen alltid... ingen gest 0.000 0.250
14 till frankrike till frankrike thank you 0.000 1.000

Note: Full segment-wise evaluation logs are available in the repository.


Audio Example

This audio file demonstrates the model's transcription abilities:

  • File: trimmed_resampled_audio.wav
  • Content: Hej du har kommit till Dressmann. Du pratar med Isabelle. Vad kan jag hjälpa dig?
  • Audio Type: Telephonic conversation.
  • Sample Rate: 16kHz (resampled).
  • Purpose: Showcasing the model's capabilities in transcribing Swedish telephonic speech.

Intended Use

This model is designed for:

  • Customer Support Automation: Transcription and analysis of call center recordings.
  • Telephony Analytics: Sentiment analysis, compliance monitoring, and business intelligence.
  • Swedish Language Research: Study of conversational patterns and colloquial expressions.

Limitations:

  • Language Support: Primarily Swedish; limited support for English.
  • Audio Quality: Optimized for telephonic audio; performance may degrade with studio-quality or highly noisy audio.
  • Preprocessing Requirement: Requires resampling non-8kHz audio to 16kHz.

Try the Model

You can test the model using the Hugging Face Playground or the dedicated endpoint:


How to Use

from transformers import WhisperForConditionalGeneration, WhisperProcessor
import soundfile as sf

# Load model and processor
model_name = "WMRNORDIC/whisper-swedish-telephonic"
model = WhisperForConditionalGeneration.from_pretrained(model_name)
processor = WhisperProcessor.from_pretrained(model_name)

# Load and preprocess audio
audio, sample_rate = sf.read("path_to_audio.wav")
inputs = processor(audio, sampling_rate=sample_rate, return_tensors="pt")

# Transcribe
generated_ids = model.generate(inputs.input_features)
transcription = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]

print("Transcription:", transcription)