File size: 3,767 Bytes
5672777
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
## Download and preprocess Criteo TB dataset

[Apache Beam](https://beam.apache.org) enables distributed preprocessing of the
dataset and can be run on
[Google Cloud Dataflow](https://cloud.google.com/dataflow/). The preprocessing
scripts can be run locally via DirectRunner provided that the local host has
enough CPU/Memory/Storage.

Install required packages.

```bash
python3 setup.py install
```


Set up the following environment variables, replacing bucket-name with the name
of your Cloud Storage bucket and project name with your GCP project name.

```bash
export STORAGE_BUCKET=gs://bucket-name
export PROJECT=my-gcp-project
export REGION=us-central1
```

Note: If running locally above environment variables won't be needed and instead
of gs://bucket-name a local path can be used, also consider passing smaller
`max_vocab_size` argument.


1.  Download raw
    [Criteo TB dataset](https://labs.criteo.com/2013/12/download-terabyte-click-logs/)
    to a GCS bucket.

Organize the data in the following way:

*   The files day_0.gz, day_1.gz, ..., day_22.gz in
    ${STORAGE_BUCKET}/criteo_raw/train/

*   The file day_23.gz in ${STORAGE_BUCKET}/criteo_raw/test/

2. Shard the raw training/test data into multiple files.

```bash
python3 shard_rebalancer.py \
  --input_path "${STORAGE_BUCKET}/criteo_raw/train/*" \
  --output_path "${STORAGE_BUCKET}/criteo_raw_sharded/train/train" \
  --num_output_files 1024 --filetype csv --runner DataflowRunner \
  --project ${PROJECT} --region ${REGION}
```


```bash
python3 shard_rebalancer.py \
  --input_path "${STORAGE_BUCKET}/criteo_raw/test/*" \
  --output_path "${STORAGE_BUCKET}/criteo_raw_sharded/test/test" \
  --num_output_files 64 --filetype csv --runner DataflowRunner \
  --project ${PROJECT} --region ${REGION}
```

3. Generate vocabulary and preprocess the data.

Generate vocabulary:

```bash
python3 criteo_preprocess.py \
  --input_path "${STORAGE_BUCKET}/criteo_raw_sharded/*/*" \
  --output_path "${STORAGE_BUCKET}/criteo/" \
  --temp_dir "${STORAGE_BUCKET}/criteo_vocab/" \
  --vocab_gen_mode --runner DataflowRunner --max_vocab_size 5000000 \
  --project ${PROJECT} --region ${REGION}
```
Vocabulary for each feature is going to be generated to
`${STORAGE_BUCKET}/criteo_vocab/tftransform_tmp/feature_??_vocab` files.
Vocabulary size can be found as `wc -l <feature_vocab_file>`.

Preprocess training and test data:

```bash
python3 criteo_preprocess.py \
  --input_path "${STORAGE_BUCKET}/criteo_raw_sharded/train/*" \
  --output_path "${STORAGE_BUCKET}/criteo/train/train" \
  --temp_dir "${STORAGE_BUCKET}/criteo_vocab/" \
  --runner DataflowRunner --max_vocab_size 5000000 \
  --project ${PROJECT} --region ${REGION}
```

```bash
python3 criteo_preprocess.py \
  --input_path "${STORAGE_BUCKET}/criteo_raw_sharded/test/*" \
  --output_path "${STORAGE_BUCKET}/criteo/test/test" \
  --temp_dir "${STORAGE_BUCKET}/criteo_vocab/" \
  --runner DataflowRunner --max_vocab_size 5000000 \
  --project ${PROJECT} --region ${REGION}
```


4. (Optional) Re-balance the dataset.

```bash
python3 shard_rebalancer.py \
  --input_path "${STORAGE_BUCKET}/criteo/train/*" \
  --output_path "${STORAGE_BUCKET}/criteo_balanced/train/train" \
  --num_output_files 8192 --filetype csv --runner DataflowRunner \
  --project ${PROJECT} --region ${REGION}
```

```bash
python3 shard_rebalancer.py \
  --input_path "${STORAGE_BUCKET}/criteo/test/*" \
  --output_path "${STORAGE_BUCKET}/criteo_balanced/test/test" \
  --num_output_files 1024 --filetype csv --runner DataflowRunner \
  --project ${PROJECT} --region ${REGION}
```

At this point training and test data are in the buckets:

* `${STORAGE_BUCKET}/criteo_balanced/train/`
* `${STORAGE_BUCKET}/criteo_balanced/test/`

All other buckets can be removed.