Spaces:
Runtime error
Runtime error
File size: 3,767 Bytes
5672777 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 |
## Download and preprocess Criteo TB dataset
[Apache Beam](https://beam.apache.org) enables distributed preprocessing of the
dataset and can be run on
[Google Cloud Dataflow](https://cloud.google.com/dataflow/). The preprocessing
scripts can be run locally via DirectRunner provided that the local host has
enough CPU/Memory/Storage.
Install required packages.
```bash
python3 setup.py install
```
Set up the following environment variables, replacing bucket-name with the name
of your Cloud Storage bucket and project name with your GCP project name.
```bash
export STORAGE_BUCKET=gs://bucket-name
export PROJECT=my-gcp-project
export REGION=us-central1
```
Note: If running locally above environment variables won't be needed and instead
of gs://bucket-name a local path can be used, also consider passing smaller
`max_vocab_size` argument.
1. Download raw
[Criteo TB dataset](https://labs.criteo.com/2013/12/download-terabyte-click-logs/)
to a GCS bucket.
Organize the data in the following way:
* The files day_0.gz, day_1.gz, ..., day_22.gz in
${STORAGE_BUCKET}/criteo_raw/train/
* The file day_23.gz in ${STORAGE_BUCKET}/criteo_raw/test/
2. Shard the raw training/test data into multiple files.
```bash
python3 shard_rebalancer.py \
--input_path "${STORAGE_BUCKET}/criteo_raw/train/*" \
--output_path "${STORAGE_BUCKET}/criteo_raw_sharded/train/train" \
--num_output_files 1024 --filetype csv --runner DataflowRunner \
--project ${PROJECT} --region ${REGION}
```
```bash
python3 shard_rebalancer.py \
--input_path "${STORAGE_BUCKET}/criteo_raw/test/*" \
--output_path "${STORAGE_BUCKET}/criteo_raw_sharded/test/test" \
--num_output_files 64 --filetype csv --runner DataflowRunner \
--project ${PROJECT} --region ${REGION}
```
3. Generate vocabulary and preprocess the data.
Generate vocabulary:
```bash
python3 criteo_preprocess.py \
--input_path "${STORAGE_BUCKET}/criteo_raw_sharded/*/*" \
--output_path "${STORAGE_BUCKET}/criteo/" \
--temp_dir "${STORAGE_BUCKET}/criteo_vocab/" \
--vocab_gen_mode --runner DataflowRunner --max_vocab_size 5000000 \
--project ${PROJECT} --region ${REGION}
```
Vocabulary for each feature is going to be generated to
`${STORAGE_BUCKET}/criteo_vocab/tftransform_tmp/feature_??_vocab` files.
Vocabulary size can be found as `wc -l <feature_vocab_file>`.
Preprocess training and test data:
```bash
python3 criteo_preprocess.py \
--input_path "${STORAGE_BUCKET}/criteo_raw_sharded/train/*" \
--output_path "${STORAGE_BUCKET}/criteo/train/train" \
--temp_dir "${STORAGE_BUCKET}/criteo_vocab/" \
--runner DataflowRunner --max_vocab_size 5000000 \
--project ${PROJECT} --region ${REGION}
```
```bash
python3 criteo_preprocess.py \
--input_path "${STORAGE_BUCKET}/criteo_raw_sharded/test/*" \
--output_path "${STORAGE_BUCKET}/criteo/test/test" \
--temp_dir "${STORAGE_BUCKET}/criteo_vocab/" \
--runner DataflowRunner --max_vocab_size 5000000 \
--project ${PROJECT} --region ${REGION}
```
4. (Optional) Re-balance the dataset.
```bash
python3 shard_rebalancer.py \
--input_path "${STORAGE_BUCKET}/criteo/train/*" \
--output_path "${STORAGE_BUCKET}/criteo_balanced/train/train" \
--num_output_files 8192 --filetype csv --runner DataflowRunner \
--project ${PROJECT} --region ${REGION}
```
```bash
python3 shard_rebalancer.py \
--input_path "${STORAGE_BUCKET}/criteo/test/*" \
--output_path "${STORAGE_BUCKET}/criteo_balanced/test/test" \
--num_output_files 1024 --filetype csv --runner DataflowRunner \
--project ${PROJECT} --region ${REGION}
```
At this point training and test data are in the buckets:
* `${STORAGE_BUCKET}/criteo_balanced/train/`
* `${STORAGE_BUCKET}/criteo_balanced/test/`
All other buckets can be removed.
|