versae's picture
Mdels and code
dcc5cd1

Samples can be extracted using shuf from clean_jsonl_3 folders. The need to be prefixed by the doc_type, and suffixed by the language code.

For example:

cat /nfsmounts/datastore/ncc_corpus/mimir/clean_jsonl_3/external-hplt*nn.jsonl |shuf -n 1000000  > hplt_nno.jsonl
cat /nfsmounts/datastore/ncc_corpus/mimir/clean_jsonl_3/external-hplt*nb.jsonl |shuf -n 1000000  > hplt_nob.jsonl
cat /nfsmounts/datastore/ncc_corpus/mimir/clean_jsonl_3/external-hplt*da.jsonl |shuf -n 1000000  > hplt_da.jsonl
cat /nfsmounts/datastore/ncc_corpus/mimir/clean_jsonl_3/external-hplt*sv.jsonl |shuf -n 1000000  > hplt_sv.jsonl
cat /nfsmounts/datastore/ncc_corpus/mimir/clean_jsonl_3/external-hplt*is.jsonl |shuf -n 1000000  > hplt_is.jsonl

Or for the restricted books (as they are longer per document, with 100,000 should be enough):

cat /nfsmounts/datastore/ncc_corpus/mimir/clean_jsonl_3/restricted_books/restricted_books.*.jsonl|shuf -n 100000 > restricted_books_no.jsonl

Monitor memory usage while doing shuf.