metadata

library_name: transformers
pipeline_tag: automatic-speech-recognition
tags:
  - whisper
  - speech
  - swedish
  - telephonic
  - transformers
datasets:
  - WMRNORDIC/swedish-telephonic-dataset
metrics:
  - wer
base_model: openai/whisper-small
base_model_relation: finetune
license: apache-2.0
language:
  - sv
  - en
model-index:
  - name: whisper-swedish-telephonic
    results:
      - task:
          type: automatic-speech-recognition
          name: Automatic Speech Recognition
        dataset:
          name: Swedish Telephonic Dataset
          type: custom
          split: test
        metrics:
          - name: Word Error Rate (WER)
            type: wer
            value: 0.17
          - name: Base Model WER (Comparison)
            type: wer
            value: 0.888

whisper-swedish-telephonic

Model Overview

whisper-swedish-telephonic is a fine-tuned version of OpenAI's Whisper-Small model, specifically designed for transcribing Swedish telephonic audio. The model is optimized for low-bandwidth, multi-speaker conversations such as call center interactions.

Key Features:

Language: Swedish (primary), with limited support for minor English segments.
Audio Types: Telephonic conversations, customer support recordings, and general low-bandwidth audio.
Sample Rate: 8kHz (resampled to 16kHz internally).
Special Tokens: Supports conversational markers, disfluencies, and speaker-specific tags.
Performance: Demonstrates significantly improved transcription accuracy over the base model for telephonic speech.

Dataset

The model was fine-tuned using the Swedish Telephonic Dataset, consisting of:

Duration: ~97 hours of annotated audio.
Domains: Call center recordings, customer service conversations.
Annotations:
- Speaker IDs and timestamps.
- Conversational tags: (()), ~, <overlap>.
- Language switching: <lang:English>...</lang:English>.

Preprocessing:

Audio: Resampled to 16kHz.
Segmentations: Aligned with timestamps.
Special Tokens: Includes non-speech sounds like [cough], [laugh].

Model Performance

Word Error Rate (WER) Evaluation

The fine-tuned model was benchmarked against OpenAI's base Whisper-Small model using a Swedish telephonic test dataset containing 207 labeled speech segments.

Metric	Fine-Tuned Model	Base Whisper-Small
WER	0.170	0.888

Key Observations:

Fine-Tuned Model:
- Excellent transcription accuracy for colloquial Swedish, domain-specific terminology, and long utterances.
- Handles speaker-specific annotations and conversational markers effectively.
Base Model:
- Struggles with Swedish syntax and domain-specific vocabulary.
- Outputs nonsensical transcriptions for longer or complex sentences.

Example Transcriptions

Segment	Ground Truth	Fine-Tuned Model	Base Model	WER (Base)
1	så nu	så nu	so, no	1.000
2	nu record du båda va	nu record du båda va	nu rekordar du båda	0.400
3	ja jag kommer inte ihåg	ja jag kommer inte ihåg	i am coming to you	1.000
5	sen när då, sen alltid... inga gäster	sen när då, sen alltid... inga gäster	sen då, sen alltid... ingen gest	0.250
14	till frankrike	till frankrike	thank you	1.000

Note: Full segment-wise evaluation logs are available in the repository.

Audio Example

This audio file demonstrates the model's transcription abilities:

File: trimmed_resampled_audio.wav
Content: Hej du har kommit till Dressmann. Du pratar med Isabelle. Vad kan jag hjälpa dig?
Audio Type: Telephonic conversation.
Sample Rate: 16kHz (resampled).
Purpose: Showcasing the model's capabilities in transcribing Swedish telephonic speech.

Intended Use

This model is designed for:

Customer Support Automation: Transcription and analysis of call center recordings.
Telephony Analytics: Sentiment analysis, compliance monitoring, and business intelligence.
Swedish Language Research: Study of conversational patterns and colloquial expressions.

Limitations:

Language Support: Primarily Swedish; limited support for English.
Audio Quality: Optimized for telephonic audio; performance may degrade with studio-quality or highly noisy audio.
Preprocessing Requirement: Requires resampling non-8kHz audio to 16kHz.

Try the Model

You can test the model using the Hugging Face Playground or the dedicated endpoint:

Playground: Test the Model
Dedicated Endpoint: Endpoint URL

How to Use

from transformers import WhisperForConditionalGeneration, WhisperProcessor
import soundfile as sf

# Load model and processor
model_name = "WMRNORDIC/whisper-swedish-telephonic"
model = WhisperForConditionalGeneration.from_pretrained(model_name)
processor = WhisperProcessor.from_pretrained(model_name)

# Load and preprocess audio
audio, sample_rate = sf.read("path_to_audio.wav")
inputs = processor(audio, sampling_rate=sample_rate, return_tensors="pt")

# Transcribe
generated_ids = model.generate(inputs.input_features)
transcription = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]

print("Transcription:", transcription)