huckiyang's picture
finalize gui
88c90d9

A newer version of the Gradio SDK is available: 5.23.3

Upgrade
metadata
title: Post-ASR LLM N-Best Transcription Correction
emoji: 🏢
colorFrom: yellow
colorTo: yellow
sdk: gradio
sdk_version: 5.21.0
app_file: app.py
pinned: false
license: mit
short_description: Generative Error Correction (GER) Task Baseline, WER

Post-ASR Text Correction WER Leaderboard

This application displays a baseline Word Error Rate (WER) leaderboard for the test data in the GenSEC-LLM/SLT-Task1-Post-ASR-Text-Correction dataset.

Dataset Sources

The leaderboard shows WER metrics for multiple speech recognition sources as columns:

  • CHiME4
  • CORAAL
  • CommonVoice
  • LRS2
  • LibriSpeech (Clean and Other)
  • SwitchBoard
  • Tedlium-3
  • OVERALL (aggregate across all sources)

Baseline Methods

The leaderboard displays three baseline approaches:

  1. No LM Baseline: Uses the 1-best ASR output without any correction (input1)
  2. N-gram Ranking: Ranks the N-best hypotheses using a simple n-gram statistics approach and chooses the best one
  3. Subwords Voting Correction: Uses a voting-based method to correct the transcript by combining information from all N-best hypotheses

Metrics

The leaderboard displays as rows:

  • Number of Examples: Count of examples in the test set for each source
  • Word Error Rate (No LM): WER between reference and 1-best ASR output
  • Word Error Rate (N-gram Ranking): WER between reference and n-gram ranked best hypothesis
  • Word Error Rate (Subwords Voting Correction): WER between reference and the voting-corrected N-best hypothesis

Lower WER values indicate better transcription accuracy.

Table Structure

The leaderboard is displayed as a table with:

  • Rows: Different metrics (example counts and WER values for each method)
  • Columns: Different data sources (CHiME4, CORAAL, CommonVoice, etc.) and OVERALL

Each cell shows the corresponding metric for that specific data source. The OVERALL column shows aggregate metrics across all sources.

Technical Details

N-gram Ranking

This method scores each hypothesis in the N-best list using:

  • N-gram statistics (4-grams)
  • Text length
  • N-gram variety

The hypothesis with the highest score is selected.

Subwords Voting Correction

This method uses a simple voting mechanism:

  • Groups hypotheses of the same length
  • For each word position, chooses the most common word across all hypotheses
  • Constructs a new transcript from these voted words

Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference