huckiyang commited on
Commit
d7d6438
·
1 Parent(s): 381227f

more LM baseline

Browse files
Files changed (2) hide show
  1. README.md +31 -11
  2. app.py +207 -56
README.md CHANGED
@@ -27,27 +27,47 @@ The leaderboard shows WER metrics for multiple speech recognition sources as col
27
  - Tedlium-3
28
  - OVERALL (aggregate across all sources)
29
 
30
- ## Metrics
31
 
32
- The leaderboard displays as rows:
33
- - **Count**: Number of examples in the test set for each source
34
- - **No LM Baseline**: Word Error Rate between the reference transcription and 1-best ASR output without language model correction
35
 
36
- ## Baseline Calculation
 
 
37
 
38
- Word Error Rate is calculated between:
39
- - Reference transcription ("transcription" field)
40
- - 1-best ASR output ("input1" field or first item from "hypothesis" when input1 is unavailable)
41
 
42
- Lower WER values indicate better transcription accuracy.
 
 
 
 
43
 
44
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
45
 
46
  ## Table Structure
47
 
48
  The leaderboard is displayed as a table with:
49
 
50
- - **Rows**: "Number of Examples" and "Word Error Rate (WER)"
51
  - **Columns**: Different data sources (CHiME4, CORAAL, CommonVoice, etc.) and OVERALL
52
 
53
  Each cell shows the corresponding metric for that specific data source. The OVERALL column shows aggregate metrics across all sources.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
27
  - Tedlium-3
28
  - OVERALL (aggregate across all sources)
29
 
30
+ ## Baseline Methods
31
 
32
+ The leaderboard displays three baseline approaches:
 
 
33
 
34
+ 1. **No LM Baseline**: Uses the 1-best ASR output without any correction (input1)
35
+ 2. **N-best LM Ranking**: Ranks the N-best hypotheses using a simple language model approach and chooses the best one
36
+ 3. **N-best Correction**: Uses a voting-based method to correct the transcript by combining information from all N-best hypotheses
37
 
38
+ ## Metrics
 
 
39
 
40
+ The leaderboard displays as rows:
41
+ - **Number of Examples**: Count of examples in the test set for each source
42
+ - **Word Error Rate (No LM)**: WER between reference and 1-best ASR output
43
+ - **Word Error Rate (N-best LM Ranking)**: WER between reference and LM-ranked best hypothesis
44
+ - **Word Error Rate (N-best Correction)**: WER between reference and the corrected N-best hypothesis
45
 
46
+ Lower WER values indicate better transcription accuracy.
47
 
48
  ## Table Structure
49
 
50
  The leaderboard is displayed as a table with:
51
 
52
+ - **Rows**: Different metrics (example counts and WER values for each method)
53
  - **Columns**: Different data sources (CHiME4, CORAAL, CommonVoice, etc.) and OVERALL
54
 
55
  Each cell shows the corresponding metric for that specific data source. The OVERALL column shows aggregate metrics across all sources.
56
+
57
+ ## Technical Details
58
+
59
+ ### N-best LM Ranking
60
+ This method scores each hypothesis in the N-best list using:
61
+ - N-gram statistics (bigrams)
62
+ - Text length
63
+ - N-gram variety
64
+
65
+ The hypothesis with the highest score is selected.
66
+
67
+ ### N-best Correction
68
+ This method uses a simple voting mechanism:
69
+ - Groups hypotheses of the same length
70
+ - For each word position, chooses the most common word across all hypotheses
71
+ - Constructs a new transcript from these voted words
72
+
73
+ Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
app.py CHANGED
@@ -6,6 +6,8 @@ import numpy as np
6
  from functools import lru_cache
7
  import traceback
8
  import re
 
 
9
 
10
  # Cache the dataset loading to avoid reloading on refresh
11
  @lru_cache(maxsize=1)
@@ -37,6 +39,100 @@ def preprocess_text(text):
37
  text = re.sub(r'\s+', ' ', text).strip()
38
  return text
39
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
40
  # Fix the Levenshtein distance calculation to avoid dependence on jiwer internals
41
  def calculate_simple_wer(reference, hypothesis):
42
  """Calculate WER using a simple word-based approach"""
@@ -67,10 +163,10 @@ def calculate_simple_wer(reference, hypothesis):
67
  return 1.0
68
  return float(distance) / float(len(ref_words))
69
 
70
- # Calculate WER for a group of examples
71
- def calculate_wer(examples):
72
  if not examples:
73
- return 0.0
74
 
75
  try:
76
  # Check if examples is a Dataset or a list
@@ -83,7 +179,7 @@ def calculate_wer(examples):
83
  example = examples[0]
84
  else:
85
  print("No examples found")
86
- return np.nan
87
 
88
  print("\n===== EXAMPLE DATA INSPECTION =====")
89
  print(f"Keys in example: {example.keys()}")
@@ -101,7 +197,10 @@ def calculate_wer(examples):
101
  print(f"Hypothesis field '{field}' found with value: {str(example[field])[:100]}...")
102
 
103
  # Process each example in the dataset
104
- wer_values = []
 
 
 
105
  valid_count = 0
106
  skipped_count = 0
107
 
@@ -115,10 +214,19 @@ def calculate_wer(examples):
115
 
116
  for i, ex in enumerate(items_to_process):
117
  try:
118
- # Try to get transcription and input1
119
  transcription = ex.get("transcription")
 
 
 
 
 
 
 
 
 
120
 
121
- # First try input1, then use first element from hypothesis if available
122
  input1 = ex.get("input1")
123
  if input1 is None and "hypothesis" in ex and ex["hypothesis"]:
124
  if isinstance(ex["hypothesis"], list) and len(ex["hypothesis"]) > 0:
@@ -126,58 +234,89 @@ def calculate_wer(examples):
126
  elif isinstance(ex["hypothesis"], str):
127
  input1 = ex["hypothesis"]
128
 
129
- # Print debug info for a few examples
130
- if i < 3:
131
- print(f"\nExample {i} inspection:")
132
- print(f" transcription: {transcription}")
133
- print(f" input1: {input1}")
134
- print(f" type checks: transcription={type(transcription)}, input1={type(input1)}")
135
 
136
- # Skip if either field is missing
137
- if transcription is None or input1 is None:
138
- skipped_count += 1
139
- if i < 3:
140
- print(f" SKIPPED: Missing field (transcription={transcription is None}, input1={input1 is None})")
141
- continue
142
 
143
- # Skip if either field is empty after preprocessing
144
- reference = preprocess_text(transcription)
145
- hypothesis = preprocess_text(input1)
 
 
 
146
 
147
- if not reference or not hypothesis:
148
- skipped_count += 1
149
- if i < 3:
150
- print(f" SKIPPED: Empty after preprocessing (reference='{reference}', hypothesis='{hypothesis}')")
151
- continue
 
 
 
 
 
 
 
 
152
 
153
- # Calculate WER for this pair
154
- pair_wer = calculate_simple_wer(reference, hypothesis)
155
- wer_values.append(pair_wer)
156
- valid_count += 1
 
 
 
157
 
158
- if i < 3:
159
- print(f" VALID PAIR: reference='{reference}', hypothesis='{hypothesis}', WER={pair_wer:.4f}")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
160
 
161
  except Exception as ex_error:
162
  print(f"Error processing example {i}: {str(ex_error)}")
163
  skipped_count += 1
164
  continue
165
 
166
- # Calculate average WER
167
  print(f"\nProcessing summary: Valid pairs: {valid_count}, Skipped: {skipped_count}")
168
 
169
- if not wer_values:
170
- print("No valid pairs found for WER calculation")
171
- return np.nan
 
 
 
 
 
172
 
173
- avg_wer = np.mean(wer_values)
174
- print(f"Calculated {len(wer_values)} pairs with average WER: {avg_wer:.4f}")
175
- return avg_wer
176
 
177
  except Exception as e:
178
  print(f"Error in calculate_wer: {str(e)}")
179
  print(traceback.format_exc())
180
- return np.nan
181
 
182
  # Get WER metrics by source
183
  def get_wer_metrics(dataset):
@@ -218,19 +357,23 @@ def get_wer_metrics(dataset):
218
 
219
  if count > 0:
220
  print(f"\nCalculating WER for source {source} with {count} examples")
221
- wer = calculate_wer(examples) # Now handles both lists and datasets
222
  else:
223
- wer = np.nan
224
 
225
  source_results[source] = {
226
  "Count": count,
227
- "No LM Baseline": wer
 
 
228
  }
229
  except Exception as e:
230
  print(f"Error processing source {source}: {str(e)}")
231
  source_results[source] = {
232
  "Count": 0,
233
- "No LM Baseline": np.nan
 
 
234
  }
235
 
236
  # Calculate overall metrics with a sample but excluding all_et05_real
@@ -243,26 +386,35 @@ def get_wer_metrics(dataset):
243
  # Sample for calculation
244
  sample_size = min(500, total_count)
245
  sample_dataset = filtered_dataset[:sample_size]
246
- overall_wer = calculate_wer(sample_dataset)
247
 
248
  source_results["OVERALL"] = {
249
  "Count": total_count,
250
- "No LM Baseline": overall_wer
 
 
251
  }
252
  except Exception as e:
253
  print(f"Error calculating overall metrics: {str(e)}")
254
  print(traceback.format_exc())
255
  source_results["OVERALL"] = {
256
  "Count": len(filtered_dataset),
257
- "No LM Baseline": np.nan
 
 
258
  }
259
 
260
  # Create a transposed DataFrame with metrics as rows and sources as columns
261
- metrics = ["Count", "No LM Baseline"]
262
  result_df = pd.DataFrame(index=metrics, columns=["Metric"] + all_sources + ["OVERALL"])
263
 
264
  # Add descriptive column
265
- result_df["Metric"] = ["Number of Examples", "Word Error Rate (WER)"]
 
 
 
 
 
266
 
267
  for source in all_sources + ["OVERALL"]:
268
  for metric in metrics:
@@ -284,14 +436,13 @@ def format_dataframe(df):
284
  # Use vectorized operations instead of apply
285
  df = df.copy()
286
 
287
- # Find the row containing WER values (now with new index name)
288
- wer_row_index = None
289
  for idx in df.index:
290
  if "WER" in idx or "Error Rate" in idx:
291
- wer_row_index = idx
292
- break
293
 
294
- if wer_row_index:
295
  # Convert to object type first to avoid warnings
296
  df.loc[wer_row_index] = df.loc[wer_row_index].astype(object)
297
 
@@ -323,7 +474,7 @@ def create_leaderboard():
323
  # Create the Gradio interface
324
  with gr.Blocks(title="ASR Text Correction Test Leaderboard") as demo:
325
  gr.Markdown("# ASR Text Correction Baseline WER Leaderboard (Test Data)")
326
- gr.Markdown("Word Error Rate (WER) metrics for different speech sources with No Language Model baseline")
327
 
328
  with gr.Row():
329
  refresh_btn = gr.Button("Refresh Leaderboard")
 
6
  from functools import lru_cache
7
  import traceback
8
  import re
9
+ import string
10
+ from collections import Counter
11
 
12
  # Cache the dataset loading to avoid reloading on refresh
13
  @lru_cache(maxsize=1)
 
39
  text = re.sub(r'\s+', ' ', text).strip()
40
  return text
41
 
42
+ # Simple language model scoring - count n-grams
43
+ def score_hypothesis(hypothesis, n=4):
44
+ """Score a hypothesis using simple n-gram statistics"""
45
+ if not hypothesis:
46
+ return 0
47
+
48
+ words = hypothesis.split()
49
+ if len(words) < n:
50
+ return len(words) # Just return word count for very short texts
51
+
52
+ # Count n-grams
53
+ ngrams = []
54
+ for i in range(len(words) - n + 1):
55
+ ngram = ' '.join(words[i:i+n])
56
+ ngrams.append(ngram)
57
+
58
+ # More unique n-grams might indicate better fluency
59
+ unique_ngrams = len(set(ngrams))
60
+ total_ngrams = len(ngrams)
61
+
62
+ # Score is a combination of length and n-gram variety
63
+ score = len(words) + unique_ngrams/max(1, total_ngrams) * 5
64
+ return score
65
+
66
+ # N-best LM ranking approach
67
+ def get_best_hypothesis_lm(hypotheses):
68
+ """Choose the best hypothesis using a simple language model approach"""
69
+ if not hypotheses:
70
+ return ""
71
+
72
+ # Convert to list if it's not already
73
+ if isinstance(hypotheses, str):
74
+ return hypotheses
75
+
76
+ # Ensure we have a list of strings
77
+ hypothesis_list = []
78
+ for h in hypotheses:
79
+ if isinstance(h, str):
80
+ hypothesis_list.append(preprocess_text(h))
81
+
82
+ if not hypothesis_list:
83
+ return ""
84
+
85
+ # Score each hypothesis and choose the best one
86
+ scores = [(score_hypothesis(h), h) for h in hypothesis_list]
87
+ best_hypothesis = max(scores, key=lambda x: x[0])[1]
88
+ return best_hypothesis
89
+
90
+ # N-best correction approach
91
+ def correct_hypotheses(hypotheses):
92
+ """Simple n-best correction by voting on words"""
93
+ if not hypotheses:
94
+ return ""
95
+
96
+ # Convert to list if it's not already
97
+ if isinstance(hypotheses, str):
98
+ return hypotheses
99
+
100
+ # Ensure we have a list of strings
101
+ hypothesis_list = []
102
+ for h in hypotheses:
103
+ if isinstance(h, str):
104
+ hypothesis_list.append(preprocess_text(h))
105
+
106
+ if not hypothesis_list:
107
+ return ""
108
+
109
+ # Split hypotheses into words
110
+ word_lists = [h.split() for h in hypothesis_list]
111
+
112
+ # Find the most common length
113
+ lengths = [len(words) for words in word_lists]
114
+ if not lengths:
115
+ return ""
116
+
117
+ most_common_length = Counter(lengths).most_common(1)[0][0]
118
+
119
+ # Only consider hypotheses with the most common length
120
+ filtered_word_lists = [words for words in word_lists if len(words) == most_common_length]
121
+
122
+ if not filtered_word_lists:
123
+ # Fall back to the longest hypothesis if filtering removed everything
124
+ return max(hypothesis_list, key=len)
125
+
126
+ # Vote on each word position
127
+ corrected_words = []
128
+ for i in range(most_common_length):
129
+ position_words = [words[i] for words in filtered_word_lists]
130
+ most_common_word = Counter(position_words).most_common(1)[0][0]
131
+ corrected_words.append(most_common_word)
132
+
133
+ # Join the corrected words
134
+ return ' '.join(corrected_words)
135
+
136
  # Fix the Levenshtein distance calculation to avoid dependence on jiwer internals
137
  def calculate_simple_wer(reference, hypothesis):
138
  """Calculate WER using a simple word-based approach"""
 
163
  return 1.0
164
  return float(distance) / float(len(ref_words))
165
 
166
+ # Calculate WER for a group of examples with multiple methods
167
+ def calculate_wer_methods(examples):
168
  if not examples:
169
+ return 0.0, 0.0, 0.0
170
 
171
  try:
172
  # Check if examples is a Dataset or a list
 
179
  example = examples[0]
180
  else:
181
  print("No examples found")
182
+ return np.nan, np.nan, np.nan
183
 
184
  print("\n===== EXAMPLE DATA INSPECTION =====")
185
  print(f"Keys in example: {example.keys()}")
 
197
  print(f"Hypothesis field '{field}' found with value: {str(example[field])[:100]}...")
198
 
199
  # Process each example in the dataset
200
+ wer_values_no_lm = []
201
+ wer_values_lm_ranking = []
202
+ wer_values_n_best_correction = []
203
+
204
  valid_count = 0
205
  skipped_count = 0
206
 
 
214
 
215
  for i, ex in enumerate(items_to_process):
216
  try:
217
+ # Get reference transcription
218
  transcription = ex.get("transcription")
219
+ if not transcription or not isinstance(transcription, str):
220
+ skipped_count += 1
221
+ continue
222
+
223
+ # Process the reference
224
+ reference = preprocess_text(transcription)
225
+ if not reference:
226
+ skipped_count += 1
227
+ continue
228
 
229
+ # Get 1-best hypothesis for baseline
230
  input1 = ex.get("input1")
231
  if input1 is None and "hypothesis" in ex and ex["hypothesis"]:
232
  if isinstance(ex["hypothesis"], list) and len(ex["hypothesis"]) > 0:
 
234
  elif isinstance(ex["hypothesis"], str):
235
  input1 = ex["hypothesis"]
236
 
237
+ # Get n-best hypotheses for other methods
238
+ n_best_hypotheses = ex.get("hypothesis", [])
 
 
 
 
239
 
240
+ # Process and evaluate all methods
 
 
 
 
 
241
 
242
+ # Method 1: No LM (1-best ASR output)
243
+ if input1 and isinstance(input1, str):
244
+ no_lm_hyp = preprocess_text(input1)
245
+ if no_lm_hyp:
246
+ wer_no_lm = calculate_simple_wer(reference, no_lm_hyp)
247
+ wer_values_no_lm.append(wer_no_lm)
248
 
249
+ # Method 2: LM ranking (best of n-best)
250
+ if n_best_hypotheses:
251
+ lm_best_hyp = get_best_hypothesis_lm(n_best_hypotheses)
252
+ if lm_best_hyp:
253
+ wer_lm = calculate_simple_wer(reference, lm_best_hyp)
254
+ wer_values_lm_ranking.append(wer_lm)
255
+
256
+ # Method 3: N-best correction (voting among n-best)
257
+ if n_best_hypotheses:
258
+ corrected_hyp = correct_hypotheses(n_best_hypotheses)
259
+ if corrected_hyp:
260
+ wer_corrected = calculate_simple_wer(reference, corrected_hyp)
261
+ wer_values_n_best_correction.append(wer_corrected)
262
 
263
+ # Count as valid if at least one method worked
264
+ if (wer_values_no_lm and i == len(wer_values_no_lm) - 1) or \
265
+ (wer_values_lm_ranking and i == len(wer_values_lm_ranking) - 1) or \
266
+ (wer_values_n_best_correction and i == len(wer_values_n_best_correction) - 1):
267
+ valid_count += 1
268
+ else:
269
+ skipped_count += 1
270
 
271
+ # Print debug info for a few examples
272
+ if i < 2:
273
+ print(f"\nExample {i} inspection:")
274
+ print(f" Reference: '{reference}'")
275
+
276
+ if input1 and isinstance(input1, str):
277
+ no_lm_hyp = preprocess_text(input1)
278
+ print(f" No LM (1-best): '{no_lm_hyp}'")
279
+ if no_lm_hyp:
280
+ wer = calculate_simple_wer(reference, no_lm_hyp)
281
+ print(f" No LM WER: {wer:.4f}")
282
+
283
+ if n_best_hypotheses:
284
+ print(f" N-best count: {len(n_best_hypotheses) if isinstance(n_best_hypotheses, list) else 'not a list'}")
285
+ lm_best_hyp = get_best_hypothesis_lm(n_best_hypotheses)
286
+ print(f" LM ranking best: '{lm_best_hyp}'")
287
+ if lm_best_hyp:
288
+ wer = calculate_simple_wer(reference, lm_best_hyp)
289
+ print(f" LM ranking WER: {wer:.4f}")
290
+
291
+ corrected_hyp = correct_hypotheses(n_best_hypotheses)
292
+ print(f" N-best correction: '{corrected_hyp}'")
293
+ if corrected_hyp:
294
+ wer = calculate_simple_wer(reference, corrected_hyp)
295
+ print(f" N-best correction WER: {wer:.4f}")
296
 
297
  except Exception as ex_error:
298
  print(f"Error processing example {i}: {str(ex_error)}")
299
  skipped_count += 1
300
  continue
301
 
302
+ # Calculate average WER for each method
303
  print(f"\nProcessing summary: Valid pairs: {valid_count}, Skipped: {skipped_count}")
304
 
305
+ no_lm_wer = np.mean(wer_values_no_lm) if wer_values_no_lm else np.nan
306
+ lm_ranking_wer = np.mean(wer_values_lm_ranking) if wer_values_lm_ranking else np.nan
307
+ n_best_correction_wer = np.mean(wer_values_n_best_correction) if wer_values_n_best_correction else np.nan
308
+
309
+ print(f"Calculated WERs:")
310
+ print(f" No LM: {len(wer_values_no_lm)} pairs, avg WER: {no_lm_wer:.4f}")
311
+ print(f" LM Ranking: {len(wer_values_lm_ranking)} pairs, avg WER: {lm_ranking_wer:.4f}")
312
+ print(f" N-best Correction: {len(wer_values_n_best_correction)} pairs, avg WER: {n_best_correction_wer:.4f}")
313
 
314
+ return no_lm_wer, lm_ranking_wer, n_best_correction_wer
 
 
315
 
316
  except Exception as e:
317
  print(f"Error in calculate_wer: {str(e)}")
318
  print(traceback.format_exc())
319
+ return np.nan, np.nan, np.nan
320
 
321
  # Get WER metrics by source
322
  def get_wer_metrics(dataset):
 
357
 
358
  if count > 0:
359
  print(f"\nCalculating WER for source {source} with {count} examples")
360
+ no_lm_wer, lm_ranking_wer, n_best_wer = calculate_wer_methods(examples)
361
  else:
362
+ no_lm_wer, lm_ranking_wer, n_best_wer = np.nan, np.nan, np.nan
363
 
364
  source_results[source] = {
365
  "Count": count,
366
+ "No LM Baseline": no_lm_wer,
367
+ "N-best LM Ranking": lm_ranking_wer,
368
+ "N-best Correction": n_best_wer
369
  }
370
  except Exception as e:
371
  print(f"Error processing source {source}: {str(e)}")
372
  source_results[source] = {
373
  "Count": 0,
374
+ "No LM Baseline": np.nan,
375
+ "N-best LM Ranking": np.nan,
376
+ "N-best Correction": np.nan
377
  }
378
 
379
  # Calculate overall metrics with a sample but excluding all_et05_real
 
386
  # Sample for calculation
387
  sample_size = min(500, total_count)
388
  sample_dataset = filtered_dataset[:sample_size]
389
+ no_lm_wer, lm_ranking_wer, n_best_wer = calculate_wer_methods(sample_dataset)
390
 
391
  source_results["OVERALL"] = {
392
  "Count": total_count,
393
+ "No LM Baseline": no_lm_wer,
394
+ "N-best LM Ranking": lm_ranking_wer,
395
+ "N-best Correction": n_best_wer
396
  }
397
  except Exception as e:
398
  print(f"Error calculating overall metrics: {str(e)}")
399
  print(traceback.format_exc())
400
  source_results["OVERALL"] = {
401
  "Count": len(filtered_dataset),
402
+ "No LM Baseline": np.nan,
403
+ "N-best LM Ranking": np.nan,
404
+ "N-best Correction": np.nan
405
  }
406
 
407
  # Create a transposed DataFrame with metrics as rows and sources as columns
408
+ metrics = ["Count", "No LM Baseline", "N-best LM Ranking", "N-best Correction"]
409
  result_df = pd.DataFrame(index=metrics, columns=["Metric"] + all_sources + ["OVERALL"])
410
 
411
  # Add descriptive column
412
+ result_df["Metric"] = [
413
+ "Number of Examples",
414
+ "Word Error Rate (No LM)",
415
+ "Word Error Rate (N-best LM Ranking)",
416
+ "Word Error Rate (N-best Correction)"
417
+ ]
418
 
419
  for source in all_sources + ["OVERALL"]:
420
  for metric in metrics:
 
436
  # Use vectorized operations instead of apply
437
  df = df.copy()
438
 
439
+ # Find the rows containing WER values
440
+ wer_row_indices = []
441
  for idx in df.index:
442
  if "WER" in idx or "Error Rate" in idx:
443
+ wer_row_indices.append(idx)
 
444
 
445
+ for wer_row_index in wer_row_indices:
446
  # Convert to object type first to avoid warnings
447
  df.loc[wer_row_index] = df.loc[wer_row_index].astype(object)
448
 
 
474
  # Create the Gradio interface
475
  with gr.Blocks(title="ASR Text Correction Test Leaderboard") as demo:
476
  gr.Markdown("# ASR Text Correction Baseline WER Leaderboard (Test Data)")
477
+ gr.Markdown("Word Error Rate (WER) metrics for different speech sources with multiple correction approaches")
478
 
479
  with gr.Row():
480
  refresh_btn = gr.Button("Refresh Leaderboard")