Samples can be extracted using `shuf` from `clean_jsonl_3` folders. The need to be prefixed by the doc_type, and suffixed by the language code. For example: ```bash cat /nfsmounts/datastore/ncc_corpus/mimir/clean_jsonl_3/external-hplt*nn.jsonl |shuf -n 1000000 > hplt_nno.jsonl cat /nfsmounts/datastore/ncc_corpus/mimir/clean_jsonl_3/external-hplt*nb.jsonl |shuf -n 1000000 > hplt_nob.jsonl cat /nfsmounts/datastore/ncc_corpus/mimir/clean_jsonl_3/external-hplt*da.jsonl |shuf -n 1000000 > hplt_da.jsonl cat /nfsmounts/datastore/ncc_corpus/mimir/clean_jsonl_3/external-hplt*sv.jsonl |shuf -n 1000000 > hplt_sv.jsonl cat /nfsmounts/datastore/ncc_corpus/mimir/clean_jsonl_3/external-hplt*is.jsonl |shuf -n 1000000 > hplt_is.jsonl ``` Or for the restricted books (as they are longer per document, with 100,000 should be enough): ```bash cat /nfsmounts/datastore/ncc_corpus/mimir/clean_jsonl_3/restricted_books/restricted_books.*.jsonl|shuf -n 100000 > restricted_books_no.jsonl ``` Monitor memory usage while doing `shuf`.