Maurice Weber
mauriceweber
AI & ML interests
None yet
Organizations
mauriceweber's activity
Add paper citation
1
#30 opened 3 months ago
by
davanstrien

RPV2 ccnet preprocessing
1
#29 opened 6 months ago
by
bpwl0121
sample split details
3
#4 opened over 1 year ago
by
sujantkumarkv
How can I download the sample-10B fastestly?
1
#28 opened 8 months ago
by
zgxiao
defunct book subset
4
#28 opened over 1 year ago
by
polinaeterna

How much disk space would the whole HF dataset take?
1
#27 opened 11 months ago
by
protossw512
rpv2-subsamples
1
#26 opened about 1 year ago
by
mauriceweber

The doc_id in duplicates is should contain?
3
#24 opened about 1 year ago
by
newbietuan
Deduplication steps
23
#15 opened over 1 year ago
by
ilyayudkovich
Here's a download script parallelized using Spark
1
#22 opened about 1 year ago
by
srowen

what is the meaning of snapshots in redpajama-data-v2?
2
#21 opened about 1 year ago
by
choidonghun
How to join documents and quality signals when downloading directly
3
#19 opened about 1 year ago
by
tgshdyfuhuf
Missing duplicates parquet files
5
#18 opened about 1 year ago
by
bebensee
Script to download all files of 1B sample data locally
2
#13 opened over 1 year ago
by
ivanzhouyq

What is the total size, of the entirety of this dataset in TB?
1
#10 opened over 1 year ago
by
Bayaz

What's the concept on partitions
2
#5 opened over 1 year ago
by
SwatCat
quality_signals, minhash and duplicates missing
2
#3 opened over 1 year ago
by
sheshanshag
Request to add retries into RedPajama-Data-V2.py script
1
#16 opened over 1 year ago
by
yura38
How to obtain duplicates from minhash?
1
#8 opened over 1 year ago
by
cq

Obtaining Filtered Samples
4
#12 opened over 1 year ago
by
ssingh22
