File size: 2,588 Bytes
652bcfd
d1d1f71
652bcfd
 
 
 
 
 
 
 
ba0ed8b
652bcfd
 
c7f8633
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d7d6438
c7f8633
d7d6438
c7f8633
d7d6438
88c90d9
 
c7f8633
d7d6438
c7f8633
d7d6438
 
 
88c90d9
 
c7f8633
d7d6438
381227f
 
 
 
 
d7d6438
381227f
 
 
d7d6438
 
 
88c90d9
d7d6438
88c90d9
d7d6438
 
 
 
 
88c90d9
d7d6438
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
---
title: Post-ASR LLM N-Best Transcription Correction
emoji: 🏢
colorFrom: yellow
colorTo: yellow
sdk: gradio
sdk_version: 5.21.0
app_file: app.py
pinned: false
license: mit
short_description: Generative Error Correction (GER) Task Baseline, WER 
---

# Post-ASR Text Correction WER Leaderboard

This application displays a baseline Word Error Rate (WER) leaderboard for the test data in the [GenSEC-LLM/SLT-Task1-Post-ASR-Text-Correction](https://huggingface.co/datasets/GenSEC-LLM/SLT-Task1-Post-ASR-Text-Correction) dataset.

## Dataset Sources

The leaderboard shows WER metrics for multiple speech recognition sources as columns:
- CHiME4
- CORAAL
- CommonVoice
- LRS2
- LibriSpeech (Clean and Other)
- SwitchBoard
- Tedlium-3
- OVERALL (aggregate across all sources)

## Baseline Methods

The leaderboard displays three baseline approaches:

1. **No LM Baseline**: Uses the 1-best ASR output without any correction (input1)
2. **N-gram Ranking**: Ranks the N-best hypotheses using a simple n-gram statistics approach and chooses the best one
3. **Subwords Voting Correction**: Uses a voting-based method to correct the transcript by combining information from all N-best hypotheses

## Metrics

The leaderboard displays as rows:
- **Number of Examples**: Count of examples in the test set for each source
- **Word Error Rate (No LM)**: WER between reference and 1-best ASR output
- **Word Error Rate (N-gram Ranking)**: WER between reference and n-gram ranked best hypothesis
- **Word Error Rate (Subwords Voting Correction)**: WER between reference and the voting-corrected N-best hypothesis

Lower WER values indicate better transcription accuracy.

## Table Structure

The leaderboard is displayed as a table with:

- **Rows**: Different metrics (example counts and WER values for each method)
- **Columns**: Different data sources (CHiME4, CORAAL, CommonVoice, etc.) and OVERALL

Each cell shows the corresponding metric for that specific data source. The OVERALL column shows aggregate metrics across all sources.

## Technical Details

### N-gram Ranking
This method scores each hypothesis in the N-best list using:
- N-gram statistics (4-grams)
- Text length
- N-gram variety

The hypothesis with the highest score is selected.

### Subwords Voting Correction
This method uses a simple voting mechanism:
- Groups hypotheses of the same length
- For each word position, chooses the most common word across all hypotheses
- Constructs a new transcript from these voted words

Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference