David Pomerenke
commited on
Commit
Β·
0a5d23d
1
Parent(s):
a65282b
Metadata and Methodology
Browse files
README.md
CHANGED
@@ -4,8 +4,15 @@ emoji: π
|
|
4 |
colorFrom: purple
|
5 |
colorTo: pink
|
6 |
sdk: gradio
|
7 |
-
license:
|
8 |
short_description: Evaluating LLM performance across all human languages.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
9 |
tags:
|
10 |
- leaderboard
|
11 |
- submission:manual
|
@@ -23,6 +30,7 @@ Check out the configuration reference at https://huggingface.co/docs/hub/spaces-
|
|
23 |
For tag meaning, see https://huggingface.co/spaces/leaderboards/LeaderboardsExplorer
|
24 |
-->
|
25 |
|
|
|
26 |
|
27 |
# AI Language Monitor π
|
28 |
|
|
|
4 |
colorFrom: purple
|
5 |
colorTo: pink
|
6 |
sdk: gradio
|
7 |
+
license: cc-by-sa-4.0
|
8 |
short_description: Evaluating LLM performance across all human languages.
|
9 |
+
datasets:
|
10 |
+
- openlanguagedata/flores_plus
|
11 |
+
models:
|
12 |
+
- meta-llama/Llama-3.3-70B-Instruct
|
13 |
+
- mistralai/Mistral-Small-24B-Instruct-2501
|
14 |
+
- deepseek-ai/DeepSeek-V3
|
15 |
+
- microsoft/phi-4
|
16 |
tags:
|
17 |
- leaderboard
|
18 |
- submission:manual
|
|
|
30 |
For tag meaning, see https://huggingface.co/spaces/leaderboards/LeaderboardsExplorer
|
31 |
-->
|
32 |
|
33 |
+
[](https://huggingface.co/spaces/datenlabor-bmz/ai-language-monitor)
|
34 |
|
35 |
# AI Language Monitor π
|
36 |
|
app.py
CHANGED
@@ -190,4 +190,24 @@ with gr.Blocks(title="AI Language Translation Benchmark") as demo:
|
|
190 |
gr.DataFrame(value=df, label="Language Results", show_search="search")
|
191 |
gr.Plot(value=scatter_plot, label="Language Coverage")
|
192 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
193 |
demo.launch()
|
|
|
190 |
gr.DataFrame(value=df, label="Language Results", show_search="search")
|
191 |
gr.Plot(value=scatter_plot, label="Language Coverage")
|
192 |
|
193 |
+
|
194 |
+
gr.Markdown("""
|
195 |
+
## Methodology
|
196 |
+
### Dataset
|
197 |
+
- Using [FLORES-200](https://huggingface.co/datasets/openlanguagedata/flores_plus) evaluation set, a high-quality human-translated benchmark comprising 200 languages
|
198 |
+
- Each language is tested with the same 100 sentences
|
199 |
+
- All translations are from the evaluated language to a fixed set of representative languages sampled by number of speakers
|
200 |
+
- Language statistics sourced from Ethnologue and Wikidata
|
201 |
+
|
202 |
+
### Models & Evaluation
|
203 |
+
- Models accessed through [OpenRouter](https://openrouter.ai/), including fast models of all big labs, open and closed
|
204 |
+
- **BLEU Score**: Translations are evaluated using the BLEU metric, which measures how similar the AI's translation is to a human reference translation -- higher is better
|
205 |
+
|
206 |
+
### Language Categories
|
207 |
+
Languages are divided into three tiers based on translation difficulty:
|
208 |
+
- High-Resource: Top 25% of languages by BLEU score (easiest to translate)
|
209 |
+
- Mid-Resource: Middle 50% of languages
|
210 |
+
- Low-Resource: Bottom 25% of languages (hardest to translate)
|
211 |
+
""", container=True)
|
212 |
+
|
213 |
demo.launch()
|