CompVis Community

university

Activity Feed Request to join this org

AI & ML interests

None defined yet.

Recent Activity

Jhayase authored a paper 5 days ago

DataComp: In search of the next generation of multimodal datasets

Jhayase authored a paper 5 days ago

Scalable Extraction of Training Data from (Production) Language Models

Jhayase authored a paper 5 days ago

Query-Based Adversarial Prompt Generation

View all activity

compvis-community's activity

RafailFridman

authored a paper about 2 months ago

DynVFX: Augmenting Real Videos with Dynamic Content

Paper • 2502.03621 • Published Feb 5 • 29

anton-l

authored a paper about 2 months ago

SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

Paper • 2502.02737 • Published Feb 4 • 213

jindi

authored a paper 2 months ago

Step-KTO: Optimizing Mathematical Reasoning through Stepwise Binary Feedback

Paper • 2501.10799 • Published Jan 18 • 15

yossig

authored a paper 2 months ago

An Empirical Study of Autoregressive Pre-training from Videos

Paper • 2501.05453 • Published Jan 9 • 40

anton-l

posted an update 3 months ago

Post

2643

Introducing 📐𝐅𝐢𝐧𝐞𝐌𝐚𝐭𝐡: the best public math pre-training dataset with 50B+ tokens!
HuggingFaceTB/finemath

Math remains challenging for LLMs and by training on FineMath we see considerable gains over other math datasets, especially on GSM8K and MATH.

We build the dataset by:
🛠️ carefully extracting math data from Common Crawl;
🔎 iteratively filtering and recalling high quality math pages using a classifier trained on synthetic annotations to identify math reasoning and deduction.

We conducted a series of ablations comparing the performance of Llama-3.2-3B-Base after continued pre-training on FineMath and observe notable gains compared to the baseline model and other public math datasets.

We hope this helps advance the performance of LLMs on math and reasoning! 🚀
We’re also releasing all the ablation models as well as the evaluation code.

HuggingFaceTB/finemath-6763fb8f71b6439b653482c2