Samples can be extracted using shuf
from clean_jsonl_3
folders. The need to be prefixed by the doc_type, and suffixed by the language code.
For example:
cat /nfsmounts/datastore/ncc_corpus/mimir/clean_jsonl_3/external-hplt*nn.jsonl |shuf -n 1000000 > hplt_nno.jsonl
cat /nfsmounts/datastore/ncc_corpus/mimir/clean_jsonl_3/external-hplt*nb.jsonl |shuf -n 1000000 > hplt_nob.jsonl
cat /nfsmounts/datastore/ncc_corpus/mimir/clean_jsonl_3/external-hplt*da.jsonl |shuf -n 1000000 > hplt_da.jsonl
cat /nfsmounts/datastore/ncc_corpus/mimir/clean_jsonl_3/external-hplt*sv.jsonl |shuf -n 1000000 > hplt_sv.jsonl
cat /nfsmounts/datastore/ncc_corpus/mimir/clean_jsonl_3/external-hplt*is.jsonl |shuf -n 1000000 > hplt_is.jsonl
Or for the restricted books (as they are longer per document, with 100,000 should be enough):
cat /nfsmounts/datastore/ncc_corpus/mimir/clean_jsonl_3/restricted_books/restricted_books.*.jsonl|shuf -n 100000 > restricted_books_no.jsonl
Monitor memory usage while doing shuf
.