YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

Model Summary

Recently, IBM has introduced GneissWeb; a large dataset yielding around 10 trillion tokens that caters to the data quality and quantity requirements of training LLMs. The models trained using GneissWeb dataset outperform those trained on FineWeb 1.1.0 by 2.14 percentage points in terms of average score computed on a set of 11 commonly used benchmarks

In order to be able to reproduce GneissWeb, we provide here a Bloom filter representing all the document ids of FineWeb 1.1.0 whose documents are part of GneissWeb. it is of size 28GB and is of the rbloom family of Bloom filters. It is to be probed with the id column of FineWeb 1.1.0 or of Common Crawl.

     Developers: IBM Research

     Release Date: Feb 10th, 2025

     License: Apache 2.0.

Usage

Intended Use

This filter offers a way to determine which documents of FineWeb 1.1.0 or Common Crawl are part of GneissWeb. Bloom Annotatory transforms are available in IBMs Data Prep Kit to make it easy to use this filter.

The Bloom Annotator transform assigns a label of 1 if the document is present in the GneissWeb Bloom filter; otherwise, it assigns 0. This approach provides a clear understanding of which documents in FineWeb 1.1.0 are also present in GneissWeb and which are not.

The id column in FineWeb 1.1.0 looks like this : urn:uuid:39147604-bfbe-4ed5-b19c-54105f8ae8a7

Testing

The Bloom Filter was tested with

   Positive Examples : ~10M uuids from 192 parquet files in GneissWeb. These span all 96 snapshots

   Negative Examples : 10,000 uuids in CC-MAIN-2024-51 (not present in FineWeb 1.1.0 and also not in GneissWeb)

The Bloom Filter was able to return correct answers for all of them

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model is not currently available via any of the supported third-party Inference Providers, and HF Inference API was unable to determine this model's library.