## Download and preprocess Criteo TB dataset [Apache Beam](https://beam.apache.org) enables distributed preprocessing of the dataset and can be run on [Google Cloud Dataflow](https://cloud.google.com/dataflow/). The preprocessing scripts can be run locally via DirectRunner provided that the local host has enough CPU/Memory/Storage. Install required packages. ```bash python3 setup.py install ``` Set up the following environment variables, replacing bucket-name with the name of your Cloud Storage bucket and project name with your GCP project name. ```bash export STORAGE_BUCKET=gs://bucket-name export PROJECT=my-gcp-project export REGION=us-central1 ``` Note: If running locally above environment variables won't be needed and instead of gs://bucket-name a local path can be used, also consider passing smaller `max_vocab_size` argument. 1. Download raw [Criteo TB dataset](https://labs.criteo.com/2013/12/download-terabyte-click-logs/) to a GCS bucket. Organize the data in the following way: * The files day_0.gz, day_1.gz, ..., day_22.gz in ${STORAGE_BUCKET}/criteo_raw/train/ * The file day_23.gz in ${STORAGE_BUCKET}/criteo_raw/test/ 2. Shard the raw training/test data into multiple files. ```bash python3 shard_rebalancer.py \ --input_path "${STORAGE_BUCKET}/criteo_raw/train/*" \ --output_path "${STORAGE_BUCKET}/criteo_raw_sharded/train/train" \ --num_output_files 1024 --filetype csv --runner DataflowRunner \ --project ${PROJECT} --region ${REGION} ``` ```bash python3 shard_rebalancer.py \ --input_path "${STORAGE_BUCKET}/criteo_raw/test/*" \ --output_path "${STORAGE_BUCKET}/criteo_raw_sharded/test/test" \ --num_output_files 64 --filetype csv --runner DataflowRunner \ --project ${PROJECT} --region ${REGION} ``` 3. Generate vocabulary and preprocess the data. Generate vocabulary: ```bash python3 criteo_preprocess.py \ --input_path "${STORAGE_BUCKET}/criteo_raw_sharded/*/*" \ --output_path "${STORAGE_BUCKET}/criteo/" \ --temp_dir "${STORAGE_BUCKET}/criteo_vocab/" \ --vocab_gen_mode --runner DataflowRunner --max_vocab_size 5000000 \ --project ${PROJECT} --region ${REGION} ``` Vocabulary for each feature is going to be generated to `${STORAGE_BUCKET}/criteo_vocab/tftransform_tmp/feature_??_vocab` files. Vocabulary size can be found as `wc -l `. Preprocess training and test data: ```bash python3 criteo_preprocess.py \ --input_path "${STORAGE_BUCKET}/criteo_raw_sharded/train/*" \ --output_path "${STORAGE_BUCKET}/criteo/train/train" \ --temp_dir "${STORAGE_BUCKET}/criteo_vocab/" \ --runner DataflowRunner --max_vocab_size 5000000 \ --project ${PROJECT} --region ${REGION} ``` ```bash python3 criteo_preprocess.py \ --input_path "${STORAGE_BUCKET}/criteo_raw_sharded/test/*" \ --output_path "${STORAGE_BUCKET}/criteo/test/test" \ --temp_dir "${STORAGE_BUCKET}/criteo_vocab/" \ --runner DataflowRunner --max_vocab_size 5000000 \ --project ${PROJECT} --region ${REGION} ``` 4. (Optional) Re-balance the dataset. ```bash python3 shard_rebalancer.py \ --input_path "${STORAGE_BUCKET}/criteo/train/*" \ --output_path "${STORAGE_BUCKET}/criteo_balanced/train/train" \ --num_output_files 8192 --filetype csv --runner DataflowRunner \ --project ${PROJECT} --region ${REGION} ``` ```bash python3 shard_rebalancer.py \ --input_path "${STORAGE_BUCKET}/criteo/test/*" \ --output_path "${STORAGE_BUCKET}/criteo_balanced/test/test" \ --num_output_files 1024 --filetype csv --runner DataflowRunner \ --project ${PROJECT} --region ${REGION} ``` At this point training and test data are in the buckets: * `${STORAGE_BUCKET}/criteo_balanced/train/` * `${STORAGE_BUCKET}/criteo_balanced/test/` All other buckets can be removed.