Spaces:

GenSEC-LLM
/

Post-ASR-LLM-Transcription-Correction

Running

App Files Files Community

Post-ASR-LLM-Transcription-Correction / README.md

huckiyang

finalize gui

88c90d9 4 months ago

preview code

raw

history blame contribute delete

2.59 kB

	---
	title: Post-ASR LLM N-Best Transcription Correction
	emoji: 🏢
	colorFrom: yellow
	colorTo: yellow
	sdk: gradio
	sdk_version: 5.21.0
	app_file: app.py
	pinned: false
	license: mit
	short_description: Generative Error Correction (GER) Task Baseline, WER
	---

	# Post-ASR Text Correction WER Leaderboard

	This application displays a baseline Word Error Rate (WER) leaderboard for the test data in the [GenSEC-LLM/SLT-Task1-Post-ASR-Text-Correction](https://huggingface.co/datasets/GenSEC-LLM/SLT-Task1-Post-ASR-Text-Correction) dataset.

	## Dataset Sources

	The leaderboard shows WER metrics for multiple speech recognition sources as columns:
	- CHiME4
	- CORAAL
	- CommonVoice
	- LRS2
	- LibriSpeech (Clean and Other)
	- SwitchBoard
	- Tedlium-3
	- OVERALL (aggregate across all sources)

	## Baseline Methods

	The leaderboard displays three baseline approaches:

	1. No LM Baseline: Uses the 1-best ASR output without any correction (input1)
	2. N-gram Ranking: Ranks the N-best hypotheses using a simple n-gram statistics approach and chooses the best one
	3. Subwords Voting Correction: Uses a voting-based method to correct the transcript by combining information from all N-best hypotheses

	## Metrics

	The leaderboard displays as rows:
	- Number of Examples: Count of examples in the test set for each source
	- Word Error Rate (No LM): WER between reference and 1-best ASR output
	- Word Error Rate (N-gram Ranking): WER between reference and n-gram ranked best hypothesis
	- Word Error Rate (Subwords Voting Correction): WER between reference and the voting-corrected N-best hypothesis

	Lower WER values indicate better transcription accuracy.

	## Table Structure

	The leaderboard is displayed as a table with:

	- Rows: Different metrics (example counts and WER values for each method)
	- Columns: Different data sources (CHiME4, CORAAL, CommonVoice, etc.) and OVERALL

	Each cell shows the corresponding metric for that specific data source. The OVERALL column shows aggregate metrics across all sources.

	## Technical Details

	### N-gram Ranking
	This method scores each hypothesis in the N-best list using:
	- N-gram statistics (4-grams)
	- Text length
	- N-gram variety

	The hypothesis with the highest score is selected.

	### Subwords Voting Correction
	This method uses a simple voting mechanism:
	- Groups hypotheses of the same length
	- For each word position, chooses the most common word across all hypotheses
	- Constructs a new transcript from these voted words

	Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference