Pretraining deduplication of data to prevent data leakage?

#55
by SS12444 - opened

Hi authors, I'm wondering if you do any kind of filtering during the pretraining stage since OBELICS and the other pretraining datasets are large, in case they contain the same images in the benchmarks. Will there be a release of the compiled pretraining dataset like cauldron? Thank you

Yes, the pretraining datasets are already released:
OBELICS: https://huggingface.co/datasets/HuggingFaceM4/OBELICS
LAION COCO: https://huggingface.co/datasets/laion/laion-coco
Conceptual Captions, WIT, etc... are available from the official websites
We put the proportions of the mixture in the paper in the appendix.
We have only deduplicated the images of the benchmarks from the SFT dataset, not the pretraining. I think there's a very low risk, for example given the fact that MMMU or MathVista were released last year, but OBELICS was made from Common Crawls dumps before that.

Sign up or log in to comment