Spaces:

fair-forward
/

evals-for-every-language

Running

David Pomerenke commited on Feb 21

Commit

0a5d23d

1 Parent(s): a65282b

Metadata and Methodology

Files changed (2) hide show

README.md CHANGED Viewed

@@ -4,8 +4,15 @@ emoji: 🌍
 colorFrom: purple
 colorTo: pink
 sdk: gradio
-license: mit
 short_description: Evaluating LLM performance across all human languages.
 tags:
 - leaderboard
 - submission:manual
@@ -23,6 +30,7 @@ Check out the configuration reference at https://huggingface.co/docs/hub/spaces-
 For tag meaning, see https://huggingface.co/spaces/leaderboards/LeaderboardsExplorer
 -->
 # AI Language Monitor 🌍

 colorFrom: purple
 colorTo: pink
 sdk: gradio
+license: cc-by-sa-4.0
 short_description: Evaluating LLM performance across all human languages.
+datasets:
+- openlanguagedata/flores_plus
+models:
+- meta-llama/Llama-3.3-70B-Instruct
+- mistralai/Mistral-Small-24B-Instruct-2501
+- deepseek-ai/DeepSeek-V3
+- microsoft/phi-4
 tags:
 - leaderboard
 - submission:manual
 For tag meaning, see https://huggingface.co/spaces/leaderboards/LeaderboardsExplorer
 -->
+[![Hugging Face](https://img.shields.io/badge/🤗%20Hugging%20Face-Space-purple)](https://huggingface.co/spaces/datenlabor-bmz/ai-language-monitor)
 # AI Language Monitor 🌍

app.py CHANGED Viewed

@@ -190,4 +190,24 @@ with gr.Blocks(title="AI Language Translation Benchmark") as demo:
     gr.DataFrame(value=df, label="Language Results", show_search="search")
     gr.Plot(value=scatter_plot, label="Language Coverage")
 demo.launch()

     gr.DataFrame(value=df, label="Language Results", show_search="search")
     gr.Plot(value=scatter_plot, label="Language Coverage")
+    gr.Markdown("""
+        ## Methodology
+        ### Dataset
+        - Using [FLORES-200](https://huggingface.co/datasets/openlanguagedata/flores_plus) evaluation set, a high-quality human-translated benchmark comprising 200 languages
+        - Each language is tested with the same 100 sentences
+        - All translations are from the evaluated language to a fixed set of representative languages sampled by number of speakers
+        - Language statistics sourced from Ethnologue and Wikidata
+        ### Models & Evaluation
+        - Models accessed through [OpenRouter](https://openrouter.ai/), including fast models of all big labs, open and closed
+        - **BLEU Score**: Translations are evaluated using the BLEU metric, which measures how similar the AI's translation is to a human reference translation -- higher is better
+        ### Language Categories
+        Languages are divided into three tiers based on translation difficulty:
+        - High-Resource: Top 25% of languages by BLEU score (easiest to translate)
+        - Mid-Resource: Middle 50% of languages
+        - Low-Resource: Bottom 25% of languages (hardest to translate)
+    """, container=True)
 demo.launch()