2 3 17

Omar Choukrani

choukrani

choukrani

AI & ML interests

None yet

Recent Activity

liked a dataset 7 days ago

atlasia/TerjamaBench

updated a Space 9 days ago

atlasia/Terjman-Large

updated a Space 9 days ago

atlasia/Open-Arabic-Dialect-Identification-Leaderboard

View all activity

Articles

TerjamaBench: A Cultural Benchmark for English-Darija Machine Translation

17 days ago

• 25

Organizations

choukrani's activity

liked a dataset 7 days ago

atlasia/TerjamaBench

Viewer • Updated 9 days ago • 850 • 246 • 13

updated 2 Spaces 9 days ago

Running

💬

Open Arabic Dialect Identification Leaderboard

updated a dataset 17 days ago

atlasia/TerjamaBench

Viewer • Updated 9 days ago • 850 • 246 • 13

upvoted an article 17 days ago

Article

TerjamaBench: A Cultural Benchmark for English-Darija Machine Translation

•

17 days ago

• 25

liked a Space 19 days ago

Running

⚡

Ad-dabit Arabic Diacritizer

Accurate Arabic Text Diacritizer (Tashkeel)

reacted to Sentdex's post with 👍 23 days ago

Post

5944

Benchmarks!

I have lately been diving deep into the main benchmarks we all use to evaluate and compare models.

If you've never actually looked under the hood for how benchmarks work, check out the LM eval harness from EleutherAI: https://github.com/EleutherAI/lm-evaluation-harness

+ check out the benchmark datasets, you can find the ones for the LLM leaderboard on the about tab here: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard, then click the dataset and actually peak at the data that comprises these benchmarks.

It feels to me like benchmarks only represent a tiny portion of what we actually use and want LLMs for, and I doubt I'm alone in that sentiment.

Beyond this, the actual evaluations of responses from models are extremely strict and often use even rudimentary NLP techniques when, at this point, we have LLMs themselves that are more than capable at evaluating and scoring responses.

It feels like we've made great strides in the quality of LLMs themselves, but almost no change in the quality of how we benchmark.

If you have any ideas for how benchmarks could be a better assessment of an LLM, or know of good research papers that tackle this challenge, please share!