Bias Leaderboard Development

community

AI & ML interests

None defined yet.

Recent Activity

Bias-Leaderboard's activity

giadapย 
posted an update 3 days ago
view post
Post
2177
We've all become experts at clicking "I agree" without a second thought. In my latest blog post, I explore why these traditional consent models are increasingly problematic in the age of generative AI.

I found three fundamental challenges:
- Scope problem: how can you know what you're agreeing to when AI could use your data in different ways?
- Temporality problem: once an AI system learns from your data, good luck trying to make it "unlearn" it.
- Autonomy trap: the data you share today could create systems that pigeonhole you tomorrow.

Individual users shouldn't bear all the responsibility, while big tech holds all the cards. We need better approaches to level the playing field, from collective advocacy and stronger technological safeguards to establishing "data fiduciaries" with a legal duty to protect our digital interests.

Available here: https://huggingface.co/blog/giadap/beyond-consent
clefourrierย 
posted an update 17 days ago
view post
Post
1917
Gemma3 family is out! Reading the tech report, and this section was really interesting to me from a methods/scientific fairness pov.

Instead of doing over-hyped comparisons, they clearly state that **results are reported in a setup which is advantageous to their models**.
(Which everybody does, but people usually don't say)

For a tech report, it makes a lot of sense to report model performance when used optimally!
On leaderboards on the other hand, comparison will be apples to apples, but in a potentially unoptimal way for a given model family (like some user interact sub-optimally with models)

Also contains a cool section (6) on training data memorization rate too! Important to see if your model will output the training data it has seen as such: always an issue for privacy/copyright/... but also very much for evaluation!

Because if your model knows its evals by heart, you're not testing for generalization.
giadapย 
posted an update about 2 months ago
view post
Post
488
From ancient medical ethics to modern AI challenges, the journey of consent represents one of humanity's most fascinating ethical evolutions. In my latest blog post, I explore how we've moved from medical paternalism to a new frontier where AI capabilities force us to rethink consent.

The "consent gap" in AI is real: while we can approve initial data use, AI systems can generate countless unforeseen applications of our personal information. It's like signing a blank check without knowing all possible amounts that could be filled in.

Should we reimagine consent for the AI age? Perhaps we need dynamic consent systems that evolve alongside AI capabilities, similar to how healthcare transformed from physician-centered authority to patient autonomy.

Curious to hear your thoughts: how can we balance technological innovation with meaningful user sovereignty over digital identity?

Read more: https://huggingface.co/blog/giadap/evolution-of-consent
megย 
posted an update 2 months ago
view post
Post
3294
๐Ÿ’ซ...And we're live!๐Ÿ’ซ Seasonal newsletter from ethicsy folks at Hugging Face, exploring the ethics of "AI Agents"
https://huggingface.co/blog/ethics-soc-7
Our analyses found:
- There's a spectrum of "agent"-ness
- *Safety* is a key issue, leading to many other value-based concerns
Read for details & what to do next!
With @evijit , @giadap , and @sasha
yjerniteย 
posted an update 2 months ago
view post
Post
2366
๐Ÿค—๐Ÿ‘ค ๐Ÿ’ป Speaking of AI agents ...
...Is easier with the right words ;)

My colleagues @meg @evijit @sasha and @giadap just published a wonderful blog post outlining some of the main relevant notions with their signature blend of value-informed and risk-benefits contrasting approach. Go have a read!

https://huggingface.co/blog/ethics-soc-7
yjerniteย 
posted an update 4 months ago
view post
Post
2240
๐Ÿ‡ช๐Ÿ‡บ Policy Thoughts in the EU AI Act Implementation ๐Ÿ‡ช๐Ÿ‡บ

There is a lot to like in the first draft of the EU GPAI Code of Practice, especially as regards transparency requirements. The Systemic Risks part, on the other hand, is concerning for both smaller developers and for external stakeholders.

I wrote more on this topic ahead of the next draft. TLDR: more attention to immediate large-scale risks and to collaborative solutions supported by evidence can help everyone - as long as developers disclose sufficient information about their design choices and deployment contexts.

Full blog here, based on our submitted response with @frimelle and @brunatrevelin :

https://huggingface.co/blog/yjernite/eu-draft-cop-risks#on-the-proposed-taxonomy-of-systemic-risks
  • 2 replies
ยท
clefourrierย 
posted an update 11 months ago
view post
Post
6121
In a basic chatbots, errors are annoyances. In medical LLMs, errors can have life-threatening consequences ๐Ÿฉธ

It's therefore vital to benchmark/follow advances in medical LLMs before even thinking about deployment.

This is why a small research team introduced a medical LLM leaderboard, to get reproducible and comparable results between LLMs, and allow everyone to follow advances in the field.

openlifescienceai/open_medical_llm_leaderboard

Congrats to @aaditya and @pminervini !
Learn more in the blog: https://huggingface.co/blog/leaderboard-medicalllm
clefourrierย 
posted an update 12 months ago
view post
Post
4764
Contamination free code evaluations with LiveCodeBench! ๐Ÿ–ฅ๏ธ

LiveCodeBench is a new leaderboard, which contains:
- complete code evaluations (on code generation, self repair, code execution, tests)
- my favorite feature: problem selection by publication date ๐Ÿ“…

This feature means that you can get model scores averaged only on new problems out of the training data. This means... contamination free code evals! ๐Ÿš€

Check it out!

Blog: https://huggingface.co/blog/leaderboard-livecodebench
Leaderboard: livecodebench/leaderboard

Congrats to @StringChaos @minimario @xu3kev @kingh0730 and @FanjiaYan for the super cool leaderboard!
clefourrierย 
posted an update 12 months ago
view post
Post
2239
๐Ÿ†• Evaluate your RL agents - who's best at Atari?๐Ÿ†

The new RL leaderboard evaluates agents in 87 possible environments (from Atari ๐ŸŽฎ to motion control simulations๐Ÿšถand more)!

When you submit your model, it's run and evaluated in real time - and the leaderboard displays small videos of the best model's run, which is super fun to watch! โœจ

Kudos to @qgallouedec for creating and maintaining the leaderboard!
Let's find out which agent is the best at games! ๐Ÿš€

open-rl-leaderboard/leaderboard