File size: 1,861 Bytes

66178bb
 
 
dcc5cd1

---
license: apache-2.0
---

# Perplexity tools

## 1. Create samples from `clean_json_3` sources

Between 1k and 1M documents. Read [samples/README.md](./samples/README.md). Output files must be prefixed by `doc_type` and suffixed by language code (2 letters). For example:

```bash
$ cat /nfsmounts/datastore/ncc_corpus/mimir/jsonl_2/nrk/nrk-articles.jsonl | shuf -n 100000 > samples/restricted-newspapers_nrk_no.json
```

## 2. Create the perplexity scores for each file

Example of how to create scores only for `doc_type` `restricted-newspapers_*` samples:

```bash
$ ls samples/restricted-newspapers_* | parallel --lb --jobs 5 python samples_scores.py {} --output_path scores/ --jobs 15
```

## 3. Create the quartiles CSV needed for segmenting and downsamplig

The different `doc_type`s will be grouped together. By passing the flag `--group_by_prefix_lang`, the grouping will happen on the pair `doc_type` prefix and language code, e.g., `wikipedia_en`.

Different downsampling ratios can be specified by using the `--sampling_ratio_per_lang` flag. For `mimir-base`, the downsampling by language is defined as follows: `"da:0.23,en:0.21,sv:0.08,is:0.50"`.

```bash
$ python samples_quartiles.py scores/ --group_by_prefix_lang --sampling_ratio_per_lang "da:0.23,en:0.21,sv:0.08,is:0.50" --output_file csv/base-perplexity_quartiles_sampling.csv
```

For `mimir-extended`, the downsampling by language is defined as follows: `"da:0.43,en:0.81,sv:0.15,code:0.62"`.

```bash
$ python samples_quartiles.py scores/ --group_by_prefix_lang --sampling_ratio_per_lang "da:0.43,en:0.81,sv:0.15,code:0.62" --output_file csv/extended-perplexity_quartiles_sampling.csv  --overwrite_prefix_lang "starcoder_en:starcode_code"
```

More information in the [spreadsheet](https://docs.google.com/spreadsheets/d/108oGVVN-Ml-TDN59UXR96oeBBt2FbgT81zt8_1y9PUw/edit?usp=sharing).