|
Samples can be extracted using `shuf` from `clean_jsonl_3` folders. The need to be prefixed by the doc_type, and suffixed by the language code. |
|
|
|
For example: |
|
|
|
```bash |
|
cat /nfsmounts/datastore/ncc_corpus/mimir/clean_jsonl_3/external-hplt*nn.jsonl |shuf -n 1000000 > hplt_nno.jsonl |
|
cat /nfsmounts/datastore/ncc_corpus/mimir/clean_jsonl_3/external-hplt*nb.jsonl |shuf -n 1000000 > hplt_nob.jsonl |
|
cat /nfsmounts/datastore/ncc_corpus/mimir/clean_jsonl_3/external-hplt*da.jsonl |shuf -n 1000000 > hplt_da.jsonl |
|
cat /nfsmounts/datastore/ncc_corpus/mimir/clean_jsonl_3/external-hplt*sv.jsonl |shuf -n 1000000 > hplt_sv.jsonl |
|
cat /nfsmounts/datastore/ncc_corpus/mimir/clean_jsonl_3/external-hplt*is.jsonl |shuf -n 1000000 > hplt_is.jsonl |
|
``` |
|
|
|
Or for the restricted books (as they are longer per document, with 100,000 should be enough): |
|
|
|
```bash |
|
cat /nfsmounts/datastore/ncc_corpus/mimir/clean_jsonl_3/restricted_books/restricted_books.*.jsonl|shuf -n 100000 > restricted_books_no.jsonl |
|
``` |
|
|
|
Monitor memory usage while doing `shuf`. |
|
|