|
|
|
# Install dependency |
|
```bash |
|
pip install -r requirement.txt |
|
``` |
|
|
|
# Download the data set |
|
```bash |
|
export WORKDIR_ROOT=<a directory which will hold all working files> |
|
|
|
``` |
|
The downloaded data will be at $WORKDIR_ROOT/ML50 |
|
|
|
# preprocess the data |
|
Install SPM [here](https://github.com/google/sentencepiece) |
|
```bash |
|
export WORKDIR_ROOT=<a directory which will hold all working files> |
|
export SPM_PATH=<a path pointing to sentencepice spm_encode.py> |
|
``` |
|
* $WORKDIR_ROOT/ML50/raw: extracted raw data |
|
* $WORKDIR_ROOT/ML50/dedup: dedup data |
|
* $WORKDIR_ROOT/ML50/clean: data with valid and test sentences removed from the dedup data |
|
|
|
|
|
|