MERaLiON
/

MERaLiON-AudioLLM-Whisper-SEA-LION

@@ -57,6 +57,7 @@ as evidenced by evaluation results on Singapore's [Multitask National Speech Cor
 > For other tasks, we employ the LLM-as-a-Judge framework,
 > which uses a pre-trained large language model to evaluate task performance
 > by generating and scoring responses based on criteria such as relevance, coherence, and accuracy.
 <div class="table*">
 <table>
@@ -73,8 +74,8 @@ as evidenced by evaluation results on Singapore's [Multitask National Speech Cor
 </thead>
 <tbody>
 <tr>
-  <td style="text-align: center;" rowspan="11"><em>Automatic Speech Recognition<br>WER (<span
-  class="math inline">↓</span>)</em></td>
   <td style="text-align: center;">LibriSpeech-Test-Clean</td>
   <td style="text-align: center;">0.03</td>
   <td style="text-align: center;">0.03</td>
@@ -163,8 +164,8 @@ as evidenced by evaluation results on Singapore's [Multitask National Speech Cor
   <td style="text-align: center;">0.18</td>
 </tr>
 <tr>
-  <td style="text-align: center;" rowspan="6"><em>Speech Translation<br>BLEU (<span
-  class="math inline">↑</span>)</em></td>
   <td style="text-align: center;">CoVoST 2 En <span
   class="math inline">→</span> Id</td>
   <td style="text-align: center;"><strong><u>32.62</u></strong></td>
@@ -219,8 +220,8 @@ as evidenced by evaluation results on Singapore's [Multitask National Speech Cor
   <td style="text-align: center;">2.83</td>
 </tr>
 <tr>
-  <td style="text-align: center;" rowspan="8"><em>Spoken Question Answering<br>LLM-as-a-Judge (<span
-  class="math inline">↑</span>)</em></td>
   <td style="text-align: center;">SLUE-SQA-5</td>
   <td style="text-align: center;">82.94</td>
   <td style="text-align: center;">80.05</td>
@@ -285,8 +286,8 @@ as evidenced by evaluation results on Singapore's [Multitask National Speech Cor
   <td style="text-align: center;"><u>71.60</u></td>
 </tr>
 <tr>
-  <td style="text-align: center;" rowspan="4"><em>Spoken Dialogue Summarization<br>LLM-as-a-Judge (<span
-  class="math inline">↑</span>)</em></td>
   <td style="text-align: center;">MNSC-SDS-Part 3</td>
   <td style="text-align: center;"><u><strong>46.80</strong></u></td>
   <td style="text-align: center;">33.80</td>
@@ -319,8 +320,8 @@ as evidenced by evaluation results on Singapore's [Multitask National Speech Cor
   <td style="text-align: center;"><u>65.40</u></td>
 </tr>
 <tr>
-  <td style="text-align: center;" rowspan="2"><em>Speech Instruction<br>LLM-as-a-Judge (<span
-  class="math inline">↑</span>)</em></td>
   <td style="text-align: center;">OpenHermes-Audio</td>
   <td style="text-align: center;"><strong>71.4</strong></td>
   <td style="text-align: center;">44.8</td>
@@ -337,8 +338,8 @@ as evidenced by evaluation results on Singapore's [Multitask National Speech Cor
   <td style="text-align: center;"><u>73.80</u></td>
 </tr>
 <tr>
-  <td style="text-align: center;" rowspan="4"><em>Paralinguistics<br>LLM-as-a-Judge (<span
-  class="math inline">↑</span>)</em></td>
   <td style="text-align: center;">VoxCeleb-Gender-Test</td>
   <td style="text-align: center;"><strong><u>99.53</u></strong></td>
   <td style="text-align: center;">99.12</td>

 > For other tasks, we employ the LLM-as-a-Judge framework,
 > which uses a pre-trained large language model to evaluate task performance
 > by generating and scoring responses based on criteria such as relevance, coherence, and accuracy.
+> Refer to the [AudioBench paper](https://arxiv.org/abs/2406.16020) for more details.
 <div class="table*">
 <table>
 </thead>
 <tbody>
 <tr>
+  <td style="text-align: center;" rowspan="11"><strong>Automatic Speech Recognition</strong><br>WER (<span
+  class="math inline">↓</span>)</td>
   <td style="text-align: center;">LibriSpeech-Test-Clean</td>
   <td style="text-align: center;">0.03</td>
   <td style="text-align: center;">0.03</td>
   <td style="text-align: center;">0.18</td>
 </tr>
 <tr>
+  <td style="text-align: center;" rowspan="6"><strong>Speech Translation</strong><br>BLEU (<span
+  class="math inline">↑</span>)</td>
   <td style="text-align: center;">CoVoST 2 En <span
   class="math inline">→</span> Id</td>
   <td style="text-align: center;"><strong><u>32.62</u></strong></td>
   <td style="text-align: center;">2.83</td>
 </tr>
 <tr>
+  <td style="text-align: center;" rowspan="8"><strong>Spoken Question Answering</strong><br>LLM-as-a-Judge (<span
+  class="math inline">↑</span>)</td>
   <td style="text-align: center;">SLUE-SQA-5</td>
   <td style="text-align: center;">82.94</td>
   <td style="text-align: center;">80.05</td>
   <td style="text-align: center;"><u>71.60</u></td>
 </tr>
 <tr>
+  <td style="text-align: center;" rowspan="4"><strong>Spoken Dialogue Summarization</strong><br>LLM-as-a-Judge (<span
+  class="math inline">↑</span>)</td>
   <td style="text-align: center;">MNSC-SDS-Part 3</td>
   <td style="text-align: center;"><u><strong>46.80</strong></u></td>
   <td style="text-align: center;">33.80</td>
   <td style="text-align: center;"><u>65.40</u></td>
 </tr>
 <tr>
+  <td style="text-align: center;" rowspan="2"><strong>Speech Instruction</strong><br>LLM-as-a-Judge (<span
+  class="math inline">↑</span>)</td>
   <td style="text-align: center;">OpenHermes-Audio</td>
   <td style="text-align: center;"><strong>71.4</strong></td>
   <td style="text-align: center;">44.8</td>
   <td style="text-align: center;"><u>73.80</u></td>
 </tr>
 <tr>
+  <td style="text-align: center;" rowspan="4"><strong>Paralinguistics</strong><br>LLM-as-a-Judge (<span
+  class="math inline">↑</span>)</td>
   <td style="text-align: center;">VoxCeleb-Gender-Test</td>
   <td style="text-align: center;"><strong><u>99.53</u></strong></td>
   <td style="text-align: center;">99.12</td>