File size: 1,029 Bytes
dcc5cd1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Samples can be extracted using `shuf` from `clean_jsonl_3` folders. The need to be prefixed by the doc_type, and suffixed by the language code.

For example:

```bash
cat /nfsmounts/datastore/ncc_corpus/mimir/clean_jsonl_3/external-hplt*nn.jsonl |shuf -n 1000000  > hplt_nno.jsonl
cat /nfsmounts/datastore/ncc_corpus/mimir/clean_jsonl_3/external-hplt*nb.jsonl |shuf -n 1000000  > hplt_nob.jsonl
cat /nfsmounts/datastore/ncc_corpus/mimir/clean_jsonl_3/external-hplt*da.jsonl |shuf -n 1000000  > hplt_da.jsonl
cat /nfsmounts/datastore/ncc_corpus/mimir/clean_jsonl_3/external-hplt*sv.jsonl |shuf -n 1000000  > hplt_sv.jsonl
cat /nfsmounts/datastore/ncc_corpus/mimir/clean_jsonl_3/external-hplt*is.jsonl |shuf -n 1000000  > hplt_is.jsonl
```

Or for the restricted books (as they are longer per document, with 100,000 should be enough):

```bash
cat /nfsmounts/datastore/ncc_corpus/mimir/clean_jsonl_3/restricted_books/restricted_books.*.jsonl|shuf -n 100000 > restricted_books_no.jsonl
```

Monitor memory usage while doing `shuf`.