--- library_name: transformers pipeline_tag: automatic-speech-recognition tags: - whisper - speech - swedish - telephonic - transformers datasets: - WMRNORDIC/swedish-telephonic-dataset metrics: - wer base_model: openai/whisper-small base_model_relation: finetune license: apache-2.0 language: - sv - en model-index: - name: whisper-swedish-telephonic results: - task: type: automatic-speech-recognition name: Automatic Speech Recognition dataset: name: Swedish Telephonic Dataset type: custom split: test metrics: - name: Word Error Rate (WER) type: wer value: 0.170 - name: Base Model WER (Comparison) type: wer value: 0.888 --- # whisper-swedish-telephonic ## Model Overview **`whisper-swedish-telephonic`** is a fine-tuned version of OpenAI's Whisper-Small model, specifically designed for transcribing Swedish telephonic audio. The model is optimized for low-bandwidth, multi-speaker conversations such as call center interactions. ### Key Features: - **Language:** Swedish (primary), with limited support for minor English segments. - **Audio Types:** Telephonic conversations, customer support recordings, and general low-bandwidth audio. - **Sample Rate:** 8kHz (resampled to 16kHz internally). - **Special Tokens:** Supports conversational markers, disfluencies, and speaker-specific tags. - **Performance:** Demonstrates significantly improved transcription accuracy over the base model for telephonic speech. --- ## Dataset The model was fine-tuned using the **Swedish Telephonic Dataset**, consisting of: - **Duration:** ~97 hours of annotated audio. - **Domains:** Call center recordings, customer service conversations. - **Annotations:** - Speaker IDs and timestamps. - Conversational tags: `(())`, `~`, ``. - Language switching: `...`. ### Preprocessing: - **Audio:** Resampled to 16kHz. - **Segmentations:** Aligned with timestamps. - **Special Tokens:** Includes non-speech sounds like `[cough]`, `[laugh]`. --- ## Model Performance ### Word Error Rate (WER) Evaluation The fine-tuned model was benchmarked against OpenAI's base Whisper-Small model using a Swedish telephonic test dataset containing 207 labeled speech segments. | Metric | Fine-Tuned Model | Base Whisper-Small | |----------|------------------|--------------------| | **WER** | 0.170 | 0.888 | ### Key Observations: - **Fine-Tuned Model:** - Excellent transcription accuracy for colloquial Swedish, domain-specific terminology, and long utterances. - Handles speaker-specific annotations and conversational markers effectively. - **Base Model:** - Struggles with Swedish syntax and domain-specific vocabulary. - Outputs nonsensical transcriptions for longer or complex sentences. --- ## Example Transcriptions | Segment | Ground Truth | Fine-Tuned Model | Base Model | WER (Fine-Tuned) | WER (Base) | |---------|---------------------------------------------|------------------------------------------|----------------------|------------------|------------| | 1 | så nu | så nu | so, no | 0.000 | 1.000 | | 2 | nu record du båda va | nu record du båda va | nu rekordar du båda | 0.000 | 0.400 | | 3 | ja jag kommer inte ihåg | ja jag kommer inte ihåg | i am coming to you | 0.000 | 1.000 | | 5 | sen när då, sen alltid... inga gäster | sen när då, sen alltid... inga gäster | sen då, sen alltid... ingen gest | 0.000 | 0.250 | | 14 | till frankrike | till frankrike | thank you | 0.000 | 1.000 | **Note:** Full segment-wise evaluation logs are available in the repository. --- ## Audio Example This audio file demonstrates the model's transcription abilities: - **File:** [trimmed_resampled_audio.wav](https://huggingface.co/WMRNORDIC/whisper-swedish-telephonic/blob/main/trimmed_resampled_audio.wav) - **Content:** *Hej du har kommit till Dressmann. Du pratar med Isabelle. Vad kan jag hjälpa dig?* - **Audio Type:** Telephonic conversation. - **Sample Rate:** 16kHz (resampled). - **Purpose:** Showcasing the model's capabilities in transcribing Swedish telephonic speech. --- ## Intended Use This model is designed for: - **Customer Support Automation:** Transcription and analysis of call center recordings. - **Telephony Analytics:** Sentiment analysis, compliance monitoring, and business intelligence. - **Swedish Language Research:** Study of conversational patterns and colloquial expressions. ### Limitations: - **Language Support:** Primarily Swedish; limited support for English. - **Audio Quality:** Optimized for telephonic audio; performance may degrade with studio-quality or highly noisy audio. - **Preprocessing Requirement:** Requires resampling non-8kHz audio to 16kHz. --- ## Try the Model You can test the model using the Hugging Face Playground or the dedicated endpoint: - **Playground:** [Test the Model](https://huggingface.co/WMRNORDIC/whisper-swedish-telephonic) - **Dedicated Endpoint:** [Endpoint URL](https://zckhajpu2q8h0sjw.us-east-1.aws.endpoints.huggingface.cloud) --- ## How to Use ```python from transformers import WhisperForConditionalGeneration, WhisperProcessor import soundfile as sf # Load model and processor model_name = "WMRNORDIC/whisper-swedish-telephonic" model = WhisperForConditionalGeneration.from_pretrained(model_name) processor = WhisperProcessor.from_pretrained(model_name) # Load and preprocess audio audio, sample_rate = sf.read("path_to_audio.wav") inputs = processor(audio, sampling_rate=sample_rate, return_tensors="pt") # Transcribe generated_ids = model.generate(inputs.input_features) transcription = processor.batch_decode(generated_ids, skip_special_tokens=True)[0] print("Transcription:", transcription)