BigScience Workshop

non-profit

https://bigscience.huggingface.co

bigscienceW

bigscience-workshop

Activity Feed

AI & ML interests

A one-year long research workshop on large language models: the Summer of Language Models 21 🌸

Recent Activity

sagot authored a paper 7 days ago

BigO(Bench) -- Can LLMs Generate Code with Controlled Time and Space Complexity?

odegiber authored a paper 9 days ago

Are Multilingual Models the Best Choice for Moderately Under-resourced Languages? A Comprehensive Assessment for Catalan

odegiber authored a paper 9 days ago

Spanish Biomedical Crawled Corpus: A Large, Diverse Dataset for Spanish Biomedical Language Models

View all activity

bigscience's activity

giadap

posted an update 2 days ago

Post

1880

We've all become experts at clicking "I agree" without a second thought. In my latest blog post, I explore why these traditional consent models are increasingly problematic in the age of generative AI.

I found three fundamental challenges:
- Scope problem: how can you know what you're agreeing to when AI could use your data in different ways?
- Temporality problem: once an AI system learns from your data, good luck trying to make it "unlearn" it.
- Autonomy trap: the data you share today could create systems that pigeonhole you tomorrow.

Individual users shouldn't bear all the responsibility, while big tech holds all the cards. We need better approaches to level the playing field, from collective advocacy and stronger technological safeguards to establishing "data fiduciaries" with a legal duty to protect our digital interests.

Available here: https://huggingface.co/blog/giadap/beyond-consent

sasha

authored a paper 3 days ago

Fully Autonomous AI Agents Should Not be Developed

Paper • 2502.02649 • Published Feb 4 • 30

w11wo

authored a paper 14 days ago

COMODO: Cross-Modal Video-to-IMU Distillation for Efficient Egocentric Human Activity Recognition

Paper • 2503.07259 • Published 18 days ago

christopher

in bigscience/bloomz-p3 15 days ago

Adding `safetensors` variant of this model

#3 opened 15 days ago by

SFconvertbot

shannons

authored a paper 15 days ago

SelfCite: Self-Supervised Alignment for Context Attribution in Large Language Models

Paper • 2502.09604 • Published Feb 13 • 34

lucmos

authored a paper 17 days ago

Escaping Plato's Cave: Towards the Alignment of 3D and Text Latent Spaces

Paper • 2503.05283 • Published 21 days ago • 3

jungok

authored a paper 25 days ago

LiteASR: Efficient Automatic Speech Recognition with Low-Rank Approximation

Paper • 2502.20583 • Published 28 days ago • 11

davanstrien

posted an update 28 days ago

Post

2792

📊 Introducing "Hugging Face Dataset Spotlight" 📊

I'm excited to share the first episode of our AI-generated podcast series focusing on nice datasets from the Hugging Face Hub!

This first episode explores mathematical reasoning datasets:

- SynthLabsAI/Big-Math-RL-Verified: Over 250,000 rigorously verified problems spanning multiple difficulty levels and mathematical domains
- open-r1/OpenR1-Math-220k: 220,000 math problems with multiple reasoning traces, verified for accuracy using Math Verify and Llama-3.3-70B models.
- facebook/natural_reasoning: 1.1 million general reasoning questions carefully deduplicated and decontaminated from existing benchmarks, showing superior scaling effects when training models like Llama3.1-8B-Instruct.

Plus a bonus segment on bespokelabs/bespoke-manim!

https://www.youtube.com/watch?v=-TgmRq45tW4

davanstrien

posted an update 29 days ago

Post

3635

Quick POC: Turn a Hugging Face dataset card into a short podcast introducing the dataset using all open models.

I think I'm the only weirdo who would enjoy listening to something like this though 😅

Here is an example for eth-nlped/stepverify

2 replies

hyunwoongko

authored a paper 29 days ago

Kanana: Compute-efficient Bilingual Language Models

Paper • 2502.18934 • Published about 1 month ago • 64

MayankSingh

authored 6 papers about 1 month ago

Unlocking Model Insights: A Dataset for Automated Model Card Generation

Paper • 2309.12616 • Published Sep 22, 2023 • 1

Unveiling the Multi-Annotation Process: Examining the Influence of Annotation Quantity and Instance Difficulty on Model Performance

Paper • 2310.14572 • Published Oct 23, 2023 • 1

Boldly Going Where No Benchmark Has Gone Before: Exposing Bias and Shortcomings in Code Generation Evaluation

Paper • 2401.03855 • Published Jan 8, 2024

Cross-lingual Editing in Multilingual Language Models

Paper • 2401.10521 • Published Jan 19, 2024 • 2

Remember This Event That Year? Assessing Temporal Information and Reasoning in Large Language Models

Paper • 2402.11997 • Published Feb 19, 2024

BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

Paper • 2211.05100 • Published Nov 9, 2022 • 31

stas

posted an update about 1 month ago

Post

2096

Do you want ArcticTraining at @SnowflakeDB to add an ability to post-train DeepSeek V3/R1 models with DPO using just a few GPU nodes?

Please vote here and tell others about it: https://github.com/snowflakedb/ArcticTraining/discussions/58

ArcticTraining is an open-source, easy to use post-training framework for NVIDIA GPUs built on top of DeepSpeed.

davanstrien

posted an update about 1 month ago

Post

2607

Hacked together a way to log trl GRPO training completions to a 🤗 dataset repo. This allows you to:

- Track rewards from multiple reward functions
- Treat the completion and rewards from training as a "proper" dataset and do EDA
- Share results for open science

The implementation is super hacky, but I'm curious if people would find this useful.

To push completions to the Hub, you just need two extra parameters:

log_completions=True
log_completions_hub_repo='your-username/repo-name'

Example dataset: davanstrien/test-logs
Colab: https://colab.research.google.com/drive/1wzBFPVthRYYTp-mEYlznLg_e_0Za1M3g

yongzx

authored a paper about 1 month ago

Sailor2: Sailing in South-East Asia with Inclusive Multilingual LLMs

Paper • 2502.12982 • Published Feb 18 • 15

davanstrien

posted an update about 1 month ago

Post

2248

Dataset descriptions for trending Hugging Face datasets? Powered by a Smol model davanstrien/Smol-Hub-tldr

AI & ML interests

Recent Activity

Team members 328

bigscience's activity

Adding `safetensors` variant of this model