David Pomerenke commited on
Commit
0a5d23d
Β·
1 Parent(s): a65282b

Metadata and Methodology

Browse files
Files changed (2) hide show
  1. README.md +9 -1
  2. app.py +20 -0
README.md CHANGED
@@ -4,8 +4,15 @@ emoji: 🌍
4
  colorFrom: purple
5
  colorTo: pink
6
  sdk: gradio
7
- license: mit
8
  short_description: Evaluating LLM performance across all human languages.
 
 
 
 
 
 
 
9
  tags:
10
  - leaderboard
11
  - submission:manual
@@ -23,6 +30,7 @@ Check out the configuration reference at https://huggingface.co/docs/hub/spaces-
23
  For tag meaning, see https://huggingface.co/spaces/leaderboards/LeaderboardsExplorer
24
  -->
25
 
 
26
 
27
  # AI Language Monitor 🌍
28
 
 
4
  colorFrom: purple
5
  colorTo: pink
6
  sdk: gradio
7
+ license: cc-by-sa-4.0
8
  short_description: Evaluating LLM performance across all human languages.
9
+ datasets:
10
+ - openlanguagedata/flores_plus
11
+ models:
12
+ - meta-llama/Llama-3.3-70B-Instruct
13
+ - mistralai/Mistral-Small-24B-Instruct-2501
14
+ - deepseek-ai/DeepSeek-V3
15
+ - microsoft/phi-4
16
  tags:
17
  - leaderboard
18
  - submission:manual
 
30
  For tag meaning, see https://huggingface.co/spaces/leaderboards/LeaderboardsExplorer
31
  -->
32
 
33
+ [![Hugging Face](https://img.shields.io/badge/πŸ€—%20Hugging%20Face-Space-purple)](https://huggingface.co/spaces/datenlabor-bmz/ai-language-monitor)
34
 
35
  # AI Language Monitor 🌍
36
 
app.py CHANGED
@@ -190,4 +190,24 @@ with gr.Blocks(title="AI Language Translation Benchmark") as demo:
190
  gr.DataFrame(value=df, label="Language Results", show_search="search")
191
  gr.Plot(value=scatter_plot, label="Language Coverage")
192
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
193
  demo.launch()
 
190
  gr.DataFrame(value=df, label="Language Results", show_search="search")
191
  gr.Plot(value=scatter_plot, label="Language Coverage")
192
 
193
+
194
+ gr.Markdown("""
195
+ ## Methodology
196
+ ### Dataset
197
+ - Using [FLORES-200](https://huggingface.co/datasets/openlanguagedata/flores_plus) evaluation set, a high-quality human-translated benchmark comprising 200 languages
198
+ - Each language is tested with the same 100 sentences
199
+ - All translations are from the evaluated language to a fixed set of representative languages sampled by number of speakers
200
+ - Language statistics sourced from Ethnologue and Wikidata
201
+
202
+ ### Models & Evaluation
203
+ - Models accessed through [OpenRouter](https://openrouter.ai/), including fast models of all big labs, open and closed
204
+ - **BLEU Score**: Translations are evaluated using the BLEU metric, which measures how similar the AI's translation is to a human reference translation -- higher is better
205
+
206
+ ### Language Categories
207
+ Languages are divided into three tiers based on translation difficulty:
208
+ - High-Resource: Top 25% of languages by BLEU score (easiest to translate)
209
+ - Mid-Resource: Middle 50% of languages
210
+ - Low-Resource: Bottom 25% of languages (hardest to translate)
211
+ """, container=True)
212
+
213
  demo.launch()