--- title: Post-ASR LLM N-Best Transcription Correction emoji: 🏢 colorFrom: yellow colorTo: yellow sdk: gradio sdk_version: 5.21.0 app_file: app.py pinned: false license: mit short_description: Generative Error Correction (GER) Task Baseline, WER --- # Post-ASR Text Correction WER Leaderboard This application displays a baseline Word Error Rate (WER) leaderboard for the test data in the [GenSEC-LLM/SLT-Task1-Post-ASR-Text-Correction](https://huggingface.co/datasets/GenSEC-LLM/SLT-Task1-Post-ASR-Text-Correction) dataset. ## Dataset Sources The leaderboard shows WER metrics for multiple speech recognition sources as columns: - CHiME4 - CORAAL - CommonVoice - LRS2 - LibriSpeech (Clean and Other) - SwitchBoard - Tedlium-3 - OVERALL (aggregate across all sources) ## Baseline Methods The leaderboard displays three baseline approaches: 1. **No LM Baseline**: Uses the 1-best ASR output without any correction (input1) 2. **N-gram Ranking**: Ranks the N-best hypotheses using a simple n-gram statistics approach and chooses the best one 3. **Subwords Voting Correction**: Uses a voting-based method to correct the transcript by combining information from all N-best hypotheses ## Metrics The leaderboard displays as rows: - **Number of Examples**: Count of examples in the test set for each source - **Word Error Rate (No LM)**: WER between reference and 1-best ASR output - **Word Error Rate (N-gram Ranking)**: WER between reference and n-gram ranked best hypothesis - **Word Error Rate (Subwords Voting Correction)**: WER between reference and the voting-corrected N-best hypothesis Lower WER values indicate better transcription accuracy. ## Table Structure The leaderboard is displayed as a table with: - **Rows**: Different metrics (example counts and WER values for each method) - **Columns**: Different data sources (CHiME4, CORAAL, CommonVoice, etc.) and OVERALL Each cell shows the corresponding metric for that specific data source. The OVERALL column shows aggregate metrics across all sources. ## Technical Details ### N-gram Ranking This method scores each hypothesis in the N-best list using: - N-gram statistics (4-grams) - Text length - N-gram variety The hypothesis with the highest score is selected. ### Subwords Voting Correction This method uses a simple voting mechanism: - Groups hypotheses of the same length - For each word position, chooses the most common word across all hypotheses - Constructs a new transcript from these voted words Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference