|
--- |
|
title: Post-ASR LLM N-Best Transcription Correction |
|
emoji: 🏢 |
|
colorFrom: yellow |
|
colorTo: yellow |
|
sdk: gradio |
|
sdk_version: 5.21.0 |
|
app_file: app.py |
|
pinned: false |
|
license: mit |
|
short_description: Generative Error Correction (GER) Task Baseline, WER |
|
--- |
|
|
|
# Post-ASR Text Correction WER Leaderboard |
|
|
|
This application displays a baseline Word Error Rate (WER) leaderboard for the test data in the [GenSEC-LLM/SLT-Task1-Post-ASR-Text-Correction](https://huggingface.co/datasets/GenSEC-LLM/SLT-Task1-Post-ASR-Text-Correction) dataset. |
|
|
|
## Dataset Sources |
|
|
|
The leaderboard shows WER metrics for multiple speech recognition sources as columns: |
|
- CHiME4 |
|
- CORAAL |
|
- CommonVoice |
|
- LRS2 |
|
- LibriSpeech (Clean and Other) |
|
- SwitchBoard |
|
- Tedlium-3 |
|
- OVERALL (aggregate across all sources) |
|
|
|
## Baseline Methods |
|
|
|
The leaderboard displays three baseline approaches: |
|
|
|
1. **No LM Baseline**: Uses the 1-best ASR output without any correction (input1) |
|
2. **N-gram Ranking**: Ranks the N-best hypotheses using a simple n-gram statistics approach and chooses the best one |
|
3. **Subwords Voting Correction**: Uses a voting-based method to correct the transcript by combining information from all N-best hypotheses |
|
|
|
## Metrics |
|
|
|
The leaderboard displays as rows: |
|
- **Number of Examples**: Count of examples in the test set for each source |
|
- **Word Error Rate (No LM)**: WER between reference and 1-best ASR output |
|
- **Word Error Rate (N-gram Ranking)**: WER between reference and n-gram ranked best hypothesis |
|
- **Word Error Rate (Subwords Voting Correction)**: WER between reference and the voting-corrected N-best hypothesis |
|
|
|
Lower WER values indicate better transcription accuracy. |
|
|
|
## Table Structure |
|
|
|
The leaderboard is displayed as a table with: |
|
|
|
- **Rows**: Different metrics (example counts and WER values for each method) |
|
- **Columns**: Different data sources (CHiME4, CORAAL, CommonVoice, etc.) and OVERALL |
|
|
|
Each cell shows the corresponding metric for that specific data source. The OVERALL column shows aggregate metrics across all sources. |
|
|
|
## Technical Details |
|
|
|
### N-gram Ranking |
|
This method scores each hypothesis in the N-best list using: |
|
- N-gram statistics (4-grams) |
|
- Text length |
|
- N-gram variety |
|
|
|
The hypothesis with the highest score is selected. |
|
|
|
### Subwords Voting Correction |
|
This method uses a simple voting mechanism: |
|
- Groups hypotheses of the same length |
|
- For each word position, chooses the most common word across all hypotheses |
|
- Constructs a new transcript from these voted words |
|
|
|
Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference |
|
|