metadata
library_name: transformers
pipeline_tag: automatic-speech-recognition
tags:
- whisper
- speech
- swedish
- telephonic
- transformers
datasets:
- WMRNORDIC/swedish-telephonic-dataset
metrics:
- wer
base_model: openai/whisper-small
base_model_relation: finetune
license: apache-2.0
language:
- sv
- en
model-index:
- name: whisper-swedish-telephonic
results:
- task:
type: automatic-speech-recognition
name: Automatic Speech Recognition
dataset:
name: Swedish Telephonic Dataset
type: custom
split: test
metrics:
- name: Word Error Rate (WER)
type: wer
value: 0.17
- name: Base Model WER (Comparison)
type: wer
value: 0.888
whisper-swedish-telephonic
Model Overview
whisper-swedish-telephonic
is a fine-tuned version of OpenAI's Whisper-Small model, specifically designed for transcribing Swedish telephonic audio. The model is optimized for low-bandwidth, multi-speaker conversations such as call center interactions.
Key Features:
- Language: Swedish (primary), with limited support for minor English segments.
- Audio Types: Telephonic conversations, customer support recordings, and general low-bandwidth audio.
- Sample Rate: 8kHz (resampled to 16kHz internally).
- Special Tokens: Supports conversational markers, disfluencies, and speaker-specific tags.
- Performance: Demonstrates significantly improved transcription accuracy over the base model for telephonic speech.
Dataset
The model was fine-tuned using the Swedish Telephonic Dataset, consisting of:
- Duration: ~97 hours of annotated audio.
- Domains: Call center recordings, customer service conversations.
- Annotations:
- Speaker IDs and timestamps.
- Conversational tags:
(())
,~
,<overlap>
. - Language switching:
<lang:English>...</lang:English>
.
Preprocessing:
- Audio: Resampled to 16kHz.
- Segmentations: Aligned with timestamps.
- Special Tokens: Includes non-speech sounds like
[cough]
,[laugh]
.
Model Performance
Word Error Rate (WER) Evaluation
The fine-tuned model was benchmarked against OpenAI's base Whisper-Small model using a Swedish telephonic test dataset containing 207 labeled speech segments.
Metric | Fine-Tuned Model | Base Whisper-Small |
---|---|---|
WER | 0.170 | 0.888 |
Key Observations:
- Fine-Tuned Model:
- Excellent transcription accuracy for colloquial Swedish, domain-specific terminology, and long utterances.
- Handles speaker-specific annotations and conversational markers effectively.
- Base Model:
- Struggles with Swedish syntax and domain-specific vocabulary.
- Outputs nonsensical transcriptions for longer or complex sentences.
Example Transcriptions
Segment | Ground Truth | Fine-Tuned Model | Base Model | WER (Fine-Tuned) | WER (Base) |
---|---|---|---|---|---|
1 | så nu | så nu | so, no | 0.000 | 1.000 |
2 | nu record du båda va | nu record du båda va | nu rekordar du båda | 0.000 | 0.400 |
3 | ja jag kommer inte ihåg | ja jag kommer inte ihåg | i am coming to you | 0.000 | 1.000 |
5 | sen när då, sen alltid... inga gäster | sen när då, sen alltid... inga gäster | sen då, sen alltid... ingen gest | 0.000 | 0.250 |
14 | till frankrike | till frankrike | thank you | 0.000 | 1.000 |
Note: Full segment-wise evaluation logs are available in the repository.
Audio Example
This audio file demonstrates the model's transcription abilities:
- File: trimmed_resampled_audio.wav
- Content: Hej du har kommit till Dressmann. Du pratar med Isabelle. Vad kan jag hjälpa dig?
- Audio Type: Telephonic conversation.
- Sample Rate: 16kHz (resampled).
- Purpose: Showcasing the model's capabilities in transcribing Swedish telephonic speech.
Intended Use
This model is designed for:
- Customer Support Automation: Transcription and analysis of call center recordings.
- Telephony Analytics: Sentiment analysis, compliance monitoring, and business intelligence.
- Swedish Language Research: Study of conversational patterns and colloquial expressions.
Limitations:
- Language Support: Primarily Swedish; limited support for English.
- Audio Quality: Optimized for telephonic audio; performance may degrade with studio-quality or highly noisy audio.
- Preprocessing Requirement: Requires resampling non-8kHz audio to 16kHz.
Try the Model
You can test the model using the Hugging Face Playground or the dedicated endpoint:
- Playground: Test the Model
- Dedicated Endpoint: Endpoint URL
How to Use
from transformers import WhisperForConditionalGeneration, WhisperProcessor
import soundfile as sf
# Load model and processor
model_name = "WMRNORDIC/whisper-swedish-telephonic"
model = WhisperForConditionalGeneration.from_pretrained(model_name)
processor = WhisperProcessor.from_pretrained(model_name)
# Load and preprocess audio
audio, sample_rate = sf.read("path_to_audio.wav")
inputs = processor(audio, sampling_rate=sample_rate, return_tensors="pt")
# Transcribe
generated_ids = model.generate(inputs.input_features)
transcription = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print("Transcription:", transcription)