{
"cells": [
{
"cell_type": "raw",
"id": "8f37ab53",
"metadata": {},
"source": [
"---\n",
"title: 19 Feature Tokenizer Transformer\n",
"description: An implementation of Feature Tokenizer Transformer on a classification task\n",
"---"
]
},
{
"cell_type": "markdown",
"id": "d41c7aa7",
"metadata": {},
"source": [
""
]
},
{
"cell_type": "markdown",
"id": "vYaXd7ISfCHG",
"metadata": {
"id": "vYaXd7ISfCHG"
},
"source": [
"
"
]
},
{
"cell_type": "markdown",
"id": "09826e04",
"metadata": {
"id": "09826e04",
"papermill": {
"duration": 0.020595,
"end_time": "2022-03-31T14:04:34.631874",
"exception": false,
"start_time": "2022-03-31T14:04:34.611279",
"status": "completed"
},
"tags": []
},
"source": [
"# Feature Tokenizer Transformer\n",
"Featured in the paper [Revisiting Deep Learning Models for Tabular Data (2021, June)](https://arxiv.org/abs/2106.11959) Feature Tokenizer Transformer is a simple adaptation of the Transformer architecture for the tabular domain. In a nutshell, Feature Tokenizer Transformer transforms all features (categorical and numerical) to embeddings and applies a stack of Transformer layers to the embeddings. Thus, every Transformer layer operates on the feature level of one object.\n",
"\n",
"In this notebook we will be implementing Feature Tokenizer Transformer using TensorFlow 2 from scratch."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "iW5s96JIyd6-",
"metadata": {
"id": "iW5s96JIyd6-"
},
"outputs": [],
"source": [
"%%capture\n",
"!pip install tensorflow-addons"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "457cf1b0",
"metadata": {
"_cell_guid": "b1076dfc-b9ad-4769-8c92-a6c4dae69d19",
"_uuid": "8f2839f25d086af736a60e9eeb907d3b93b6e0e5",
"execution": {
"iopub.execute_input": "2022-03-31T14:04:34.678481Z",
"iopub.status.busy": "2022-03-31T14:04:34.676340Z",
"iopub.status.idle": "2022-03-31T14:04:41.559310Z",
"shell.execute_reply": "2022-03-31T14:04:41.558566Z",
"shell.execute_reply.started": "2022-02-09T09:09:07.50688Z"
},
"id": "457cf1b0",
"papermill": {
"duration": 6.907577,
"end_time": "2022-03-31T14:04:41.559509",
"exception": false,
"start_time": "2022-03-31T14:04:34.651932",
"status": "completed"
},
"tags": []
},
"outputs": [],
"source": [
"import numpy as np\n",
"import pandas as pd\n",
"import matplotlib.pyplot as plt\n",
"import tensorflow as tf\n",
"from tensorflow import keras\n",
"from tensorflow.keras import layers as L\n",
"from tensorflow_addons.activations import sparsemax\n",
"from tensorflow.data import Dataset\n",
"from sklearn.preprocessing import StandardScaler\n",
"from sklearn.compose import make_column_selector, ColumnTransformer\n",
"from sklearn.model_selection import StratifiedKFold\n",
"from sklearn.metrics import confusion_matrix, classification_report\n",
"import joblib\n",
"from tensorflow.keras import utils"
]
},
{
"cell_type": "markdown",
"id": "a01faffd",
"metadata": {
"id": "a01faffd",
"papermill": {
"duration": 0.018805,
"end_time": "2022-03-31T14:04:41.597609",
"exception": false,
"start_time": "2022-03-31T14:04:41.578804",
"status": "completed"
},
"tags": []
},
"source": [
"# Data\n",
"Loading the train and test csv files into `pandas.DataFrame` and splitting the columns as features and target.\n",
"\n",
"We will be using Stratified K folds as our local cross validation."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "5f0e0c77",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 224
},
"execution": {
"iopub.execute_input": "2022-03-31T14:04:41.642070Z",
"iopub.status.busy": "2022-03-31T14:04:41.640798Z",
"iopub.status.idle": "2022-03-31T14:05:13.765418Z",
"shell.execute_reply": "2022-03-31T14:05:13.765978Z",
"shell.execute_reply.started": "2022-02-09T09:12:07.378505Z"
},
"id": "5f0e0c77",
"outputId": "4455f38a-cf9b-43b9-87a9-a2e2c8c2d275",
"papermill": {
"duration": 32.15078,
"end_time": "2022-03-31T14:05:13.766160",
"exception": false,
"start_time": "2022-03-31T14:04:41.615380",
"status": "completed"
},
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"(17000, 9)\n"
]
},
{
"data": {
"text/html": [
"\n",
"
\n", " | longitude | \n", "latitude | \n", "housing_median_age | \n", "total_rooms | \n", "total_bedrooms | \n", "population | \n", "households | \n", "median_income | \n", "median_house_value | \n", "
---|---|---|---|---|---|---|---|---|---|
0 | \n", "-114.31 | \n", "34.19 | \n", "15.0 | \n", "2 | \n", "3 | \n", "1015.0 | \n", "472.0 | \n", "1.4936 | \n", "0.0 | \n", "
1 | \n", "-114.47 | \n", "34.40 | \n", "19.0 | \n", "0 | \n", "1 | \n", "1129.0 | \n", "463.0 | \n", "1.8200 | \n", "0.0 | \n", "
2 | \n", "-114.56 | \n", "33.69 | \n", "17.0 | \n", "0 | \n", "4 | \n", "333.0 | \n", "117.0 | \n", "1.6509 | \n", "0.0 | \n", "
3 | \n", "-114.57 | \n", "33.64 | \n", "14.0 | \n", "1 | \n", "7 | \n", "515.0 | \n", "226.0 | \n", "3.1917 | \n", "0.0 | \n", "
4 | \n", "-114.57 | \n", "33.57 | \n", "20.0 | \n", "4 | \n", "6 | \n", "624.0 | \n", "262.0 | \n", "1.9250 | \n", "0.0 | \n", "