huckiyang's picture
finalize gui
88c90d9
---
title: Post-ASR LLM N-Best Transcription Correction
emoji: 🏢
colorFrom: yellow
colorTo: yellow
sdk: gradio
sdk_version: 5.21.0
app_file: app.py
pinned: false
license: mit
short_description: Generative Error Correction (GER) Task Baseline, WER
---
# Post-ASR Text Correction WER Leaderboard
This application displays a baseline Word Error Rate (WER) leaderboard for the test data in the [GenSEC-LLM/SLT-Task1-Post-ASR-Text-Correction](https://huggingface.co/datasets/GenSEC-LLM/SLT-Task1-Post-ASR-Text-Correction) dataset.
## Dataset Sources
The leaderboard shows WER metrics for multiple speech recognition sources as columns:
- CHiME4
- CORAAL
- CommonVoice
- LRS2
- LibriSpeech (Clean and Other)
- SwitchBoard
- Tedlium-3
- OVERALL (aggregate across all sources)
## Baseline Methods
The leaderboard displays three baseline approaches:
1. **No LM Baseline**: Uses the 1-best ASR output without any correction (input1)
2. **N-gram Ranking**: Ranks the N-best hypotheses using a simple n-gram statistics approach and chooses the best one
3. **Subwords Voting Correction**: Uses a voting-based method to correct the transcript by combining information from all N-best hypotheses
## Metrics
The leaderboard displays as rows:
- **Number of Examples**: Count of examples in the test set for each source
- **Word Error Rate (No LM)**: WER between reference and 1-best ASR output
- **Word Error Rate (N-gram Ranking)**: WER between reference and n-gram ranked best hypothesis
- **Word Error Rate (Subwords Voting Correction)**: WER between reference and the voting-corrected N-best hypothesis
Lower WER values indicate better transcription accuracy.
## Table Structure
The leaderboard is displayed as a table with:
- **Rows**: Different metrics (example counts and WER values for each method)
- **Columns**: Different data sources (CHiME4, CORAAL, CommonVoice, etc.) and OVERALL
Each cell shows the corresponding metric for that specific data source. The OVERALL column shows aggregate metrics across all sources.
## Technical Details
### N-gram Ranking
This method scores each hypothesis in the N-best list using:
- N-gram statistics (4-grams)
- Text length
- N-gram variety
The hypothesis with the highest score is selected.
### Subwords Voting Correction
This method uses a simple voting mechanism:
- Groups hypotheses of the same length
- For each word position, chooses the most common word across all hypotheses
- Constructs a new transcript from these voted words
Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference