{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "61c88034-1243-4ab7-9d20-96398b0d9efd",
   "metadata": {},
   "source": [
    "# Case Study: Predictive Analytics for E-commerce\n",
    "\n",
    "## Business Context:\n",
    "- You are hired as a Data Science and AI for an e-commerce company named \"Terra Store.\" Terra Store is looking to enhance its marketing strategy by predicting customer purchase behavior based on historical data. The company wants to build an AI-powered application that can provide insights into which products a customer is likely to purchase next.\n",
    "\n",
    "## Problem Statement:\n",
    "- Terra Store has provided you with a dataset containing information about customer interactions, purchases, and product details. Your task is to develop a web-based AI application that predicts the next product a customer is likely to buy. The application should be user-friendly, allowing marketing teams to target customers more effectively."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f4f2378e-150a-4caa-9de0-6d4034bded4b",
   "metadata": {},
   "source": [
    "## Data Description:\n",
    "- The dataset includes the following information:\n",
    "- Customer Interactions:\n",
    "   - Customer ID\n",
    "   - Page views\n",
    "   - Time spent on the website\n",
    "- Purchase History:\n",
    "   - Customer ID\n",
    "   - Product ID\n",
    "   - Purchase date\n",
    "- Product Details:\n",
    "   - Product ID\n",
    "   - Category\n",
    "   - Price\n",
    "   - Ratings"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "270ab428-5a85-4b96-8fc2-0b5b803c625d",
   "metadata": {},
   "source": [
    "## Step by Step\n",
    "1. Data exploration\n",
    "2. Data preprocessing\n",
    "3. Model Development\n",
    "4. Web Application Development"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0730d462-87a6-4b07-af77-ab4e02e77f9f",
   "metadata": {},
   "source": [
    "## Data exploration"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 444,
   "id": "2c8130e6-7add-4d51-8e5a-aeab1a3bec09",
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "import random\n",
    "import warnings\n",
    "import joblib\n",
    "import gradio as gr\n",
    "\n",
    "from faker import Faker\n",
    "\n",
    "import matplotlib.pyplot as plt\n",
    "import seaborn as sns\n",
    "\n",
    "from surprise import KNNWithMeans\n",
    "from surprise.model_selection import train_test_split\n",
    "from surprise.accuracy import rmse\n",
    "\n",
    "from surprise import Dataset\n",
    "from surprise import Reader\n",
    "\n",
    "fake = Faker()\n",
    "warnings.filterwarnings('ignore')\n",
    "pd.set_option(\"display.max_columns\", 100)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "b4100a7e-59e3-4fe6-a0ac-b3c3e3c69621",
   "metadata": {},
   "outputs": [],
   "source": [
    "df_ci = pd.read_csv(\"/Users/Ruangguru/Documents/Project/skilvul/customer_interactions.csv\")\n",
    "df_pd = pd.read_csv(\"/Users/Ruangguru/Documents/Project/skilvul/product_details.csv\", delimiter=';')\n",
    "df_ph = pd.read_csv(\"/Users/Ruangguru/Documents/Project/skilvul/purchase_history.csv\", delimiter=';')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c01b2fbc-683a-4169-8096-d5c2f61564ed",
   "metadata": {},
   "source": [
    "- because the data is very small, so i decided to create syntetics data "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3036a93f-129e-4a08-a446-441e2dcacfbf",
   "metadata": {},
   "source": [
    "### data customer interactions"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "365d1c9c-069a-476e-84e2-a6cd872ecaca",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>customer_id</th>\n",
       "      <th>page_views</th>\n",
       "      <th>time_spent</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>25</td>\n",
       "      <td>120</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2</td>\n",
       "      <td>20</td>\n",
       "      <td>90</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>3</td>\n",
       "      <td>30</td>\n",
       "      <td>150</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>4</td>\n",
       "      <td>15</td>\n",
       "      <td>80</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>5</td>\n",
       "      <td>22</td>\n",
       "      <td>110</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   customer_id  page_views  time_spent\n",
       "0            1          25         120\n",
       "1            2          20          90\n",
       "2            3          30         150\n",
       "3            4          15          80\n",
       "4            5          22         110"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_ci"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 169,
   "id": "9d770909-3e13-4e5f-b18b-f6db1438dfb7",
   "metadata": {},
   "outputs": [],
   "source": [
    "len_fake_data_ci = 95\n",
    "df_fake_ci = pd.DataFrame()\n",
    "df_fake_ci = df_fake_ci.assign(page_views = pd.Series(fake.random.randint(1, 35) for i in range(len_fake_data_ci)),\n",
    "                               time_spent = pd.Series(fake.random.randint(50, 300) for i in range(len_fake_data_ci)))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 170,
   "id": "d4944ac3-326b-4339-ab4b-e5c0ac555a61",
   "metadata": {},
   "outputs": [],
   "source": [
    "df_fake_ci = df_fake_ci.reset_index().rename(columns={\n",
    "    'index' : 'customer_id'\n",
    "})\n",
    "\n",
    "df_fake_ci['customer_id'] = df_fake_ci['customer_id'] + 6"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 173,
   "id": "4370a491-b716-41ce-9be6-941fc7aa16a3",
   "metadata": {},
   "outputs": [],
   "source": [
    "df_ci_full = pd.concat([df_ci, df_fake_ci], ignore_index=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4b43983c-1c6a-4bc5-8d86-0744fed3fed1",
   "metadata": {},
   "source": [
    "### product details"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "ea1710f6-b1cc-485e-8e71-de9462cd9b13",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>product_id</th>\n",
       "      <th>category</th>\n",
       "      <th>price</th>\n",
       "      <th>ratings</th>\n",
       "      <th>Unnamed: 4</th>\n",
       "      <th>Unnamed: 5</th>\n",
       "      <th>Unnamed: 6</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>101</td>\n",
       "      <td>Electronics</td>\n",
       "      <td>500</td>\n",
       "      <td>4.5</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>102</td>\n",
       "      <td>Clothing</td>\n",
       "      <td>50</td>\n",
       "      <td>3.8</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>103</td>\n",
       "      <td>Home &amp; Kitchen</td>\n",
       "      <td>200</td>\n",
       "      <td>4.2</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>104</td>\n",
       "      <td>Beauty</td>\n",
       "      <td>30</td>\n",
       "      <td>4.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>105</td>\n",
       "      <td>Electronics</td>\n",
       "      <td>800</td>\n",
       "      <td>4.8</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   product_id        category  price  ratings  Unnamed: 4  Unnamed: 5  \\\n",
       "0         101     Electronics    500      4.5         NaN         NaN   \n",
       "1         102        Clothing     50      3.8         NaN         NaN   \n",
       "2         103  Home & Kitchen    200      4.2         NaN         NaN   \n",
       "3         104          Beauty     30      4.0         NaN         NaN   \n",
       "4         105     Electronics    800      4.8         NaN         NaN   \n",
       "\n",
       "   Unnamed: 6  \n",
       "0         NaN  \n",
       "1         NaN  \n",
       "2         NaN  \n",
       "3         NaN  \n",
       "4         NaN  "
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_pd"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 196,
   "id": "5af8face-680b-443f-91c3-bafe78e322a8",
   "metadata": {},
   "outputs": [],
   "source": [
    "len_fake_data_pd = 20\n",
    "df_fake_pd = pd.DataFrame()\n",
    "df_fake_pd = df_fake_pd.assign(category = pd.Series(random.choice(['Electronics', 'Clothing', 'Home & Kitchen', 'Beauty']) for i in range(len_fake_data_pd)),\n",
    "                               price = pd.Series(fake.random.randint(30, 1000) for i in range(len_fake_data_ci)),\n",
    "                               ratings = pd.Series(round(fake.random.uniform(1, 5), 2) for i in range(len_fake_data_ci)))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 197,
   "id": "c2d9c9cf-096f-4eeb-a5b1-30c02d43655b",
   "metadata": {},
   "outputs": [],
   "source": [
    "df_fake_pd = df_fake_pd.reset_index().rename(columns={\n",
    "    'index' : 'product_id'\n",
    "})\n",
    "\n",
    "df_fake_pd['product_id'] = df_fake_pd['product_id'] + 106"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 199,
   "id": "8915489e-fab3-4582-9c4f-2d9e5d5037f8",
   "metadata": {},
   "outputs": [],
   "source": [
    "df_pd_full = pd.concat([df_pd, df_fake_pd], ignore_index=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9d5c54e3-a9fb-46ce-8614-b33093075196",
   "metadata": {},
   "source": [
    "### purchase history"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 246,
   "id": "3e8adc2c-12f4-4de1-a329-e6e6500577ae",
   "metadata": {},
   "outputs": [],
   "source": [
    "len_fake_data_ph = 994\n",
    "df_fake_ph = pd.DataFrame()\n",
    "df_fake_ph = df_fake_ph.assign(customer_id = pd.Series(fake.random.randint(1, 100) for i in range(len_fake_data_ph)),\n",
    "                               product_id = pd.Series(fake.random.randint(101, 125) for i in range(len_fake_data_ph)),\n",
    "                               purchase_date = pd.Series(fake.date_between_dates(pd.to_datetime('2023-01-01'), pd.to_datetime('2023-01-06')) for i in range(len_fake_data_ph)))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 247,
   "id": "5e6089c9-16c9-4f72-b722-fb9f96f55af4",
   "metadata": {},
   "outputs": [],
   "source": [
    "df_ph_full = pd.concat([df_ph, df_fake_ph], ignore_index=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "632e3b2b-f489-43f4-8b08-4853745bd259",
   "metadata": {},
   "source": [
    "### final data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 248,
   "id": "fb959b97-8ddf-4b99-a985-65096de6e019",
   "metadata": {},
   "outputs": [],
   "source": [
    "df_ph_full = pd.merge(df_ph_full, df_ci_full, on=['customer_id']).drop(columns=['Unnamed: 3', 'Unnamed: 4', 'Unnamed: 5', 'Unnamed: 6'])\n",
    "df_ph_final = pd.merge(df_ph_full, df_pd_full, on=['product_id']).drop(columns=['Unnamed: 4', 'Unnamed: 5', 'Unnamed: 6'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 249,
   "id": "be4f2ee4-4dfa-4feb-959c-700861e4725d",
   "metadata": {},
   "outputs": [],
   "source": [
    "df_ph_final['purchase_date'] = pd.to_datetime(df_ph_final['purchase_date'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 250,
   "id": "3e2980ea-13c3-45db-a862-2b6b0dd2793e",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>customer_id</th>\n",
       "      <th>product_id</th>\n",
       "      <th>purchase_date</th>\n",
       "      <th>page_views</th>\n",
       "      <th>time_spent</th>\n",
       "      <th>category</th>\n",
       "      <th>price</th>\n",
       "      <th>ratings</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>101</td>\n",
       "      <td>2023-01-01</td>\n",
       "      <td>25</td>\n",
       "      <td>120</td>\n",
       "      <td>Electronics</td>\n",
       "      <td>500</td>\n",
       "      <td>4.5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>3</td>\n",
       "      <td>101</td>\n",
       "      <td>2023-01-01</td>\n",
       "      <td>30</td>\n",
       "      <td>150</td>\n",
       "      <td>Electronics</td>\n",
       "      <td>500</td>\n",
       "      <td>4.5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>3</td>\n",
       "      <td>101</td>\n",
       "      <td>2023-01-03</td>\n",
       "      <td>30</td>\n",
       "      <td>150</td>\n",
       "      <td>Electronics</td>\n",
       "      <td>500</td>\n",
       "      <td>4.5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>4</td>\n",
       "      <td>101</td>\n",
       "      <td>2023-01-02</td>\n",
       "      <td>15</td>\n",
       "      <td>80</td>\n",
       "      <td>Electronics</td>\n",
       "      <td>500</td>\n",
       "      <td>4.5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>5</td>\n",
       "      <td>101</td>\n",
       "      <td>2023-01-05</td>\n",
       "      <td>22</td>\n",
       "      <td>110</td>\n",
       "      <td>Electronics</td>\n",
       "      <td>500</td>\n",
       "      <td>4.5</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   customer_id  product_id purchase_date  page_views  time_spent     category  \\\n",
       "0            1         101    2023-01-01          25         120  Electronics   \n",
       "1            3         101    2023-01-01          30         150  Electronics   \n",
       "2            3         101    2023-01-03          30         150  Electronics   \n",
       "3            4         101    2023-01-02          15          80  Electronics   \n",
       "4            5         101    2023-01-05          22         110  Electronics   \n",
       "\n",
       "   price  ratings  \n",
       "0    500      4.5  \n",
       "1    500      4.5  \n",
       "2    500      4.5  \n",
       "3    500      4.5  \n",
       "4    500      4.5  "
      ]
     },
     "execution_count": 250,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_ph_final.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f7a3bd03-0cc0-4436-9c38-83f773955fa8",
   "metadata": {},
   "source": [
    "- the final data that i will used to build the model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 251,
   "id": "ead8f76e-0c6b-48d1-a1d4-8516da1d0bb4",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(1000, 8)"
      ]
     },
     "execution_count": 251,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_ph_final.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9bd56d0f-8cc3-4019-b538-b53b6b595b6a",
   "metadata": {},
   "source": [
    "- total data 1000 with 8 columns"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 252,
   "id": "57449465-a0e0-4cff-9bd1-ee0f4de35087",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>customer_id</th>\n",
       "      <th>product_id</th>\n",
       "      <th>purchase_date</th>\n",
       "      <th>page_views</th>\n",
       "      <th>time_spent</th>\n",
       "      <th>price</th>\n",
       "      <th>ratings</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>count</th>\n",
       "      <td>1000.000000</td>\n",
       "      <td>1000.00000</td>\n",
       "      <td>1000</td>\n",
       "      <td>1000.000000</td>\n",
       "      <td>1000.000000</td>\n",
       "      <td>1000.000000</td>\n",
       "      <td>1000.00000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>mean</th>\n",
       "      <td>49.077000</td>\n",
       "      <td>112.73200</td>\n",
       "      <td>2023-01-02 22:45:07.200000</td>\n",
       "      <td>19.996000</td>\n",
       "      <td>163.470000</td>\n",
       "      <td>469.828000</td>\n",
       "      <td>3.03613</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>min</th>\n",
       "      <td>1.000000</td>\n",
       "      <td>101.00000</td>\n",
       "      <td>2023-01-01 00:00:00</td>\n",
       "      <td>2.000000</td>\n",
       "      <td>50.000000</td>\n",
       "      <td>30.000000</td>\n",
       "      <td>1.06000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25%</th>\n",
       "      <td>24.000000</td>\n",
       "      <td>106.00000</td>\n",
       "      <td>2023-01-02 00:00:00</td>\n",
       "      <td>11.000000</td>\n",
       "      <td>98.000000</td>\n",
       "      <td>146.000000</td>\n",
       "      <td>1.48000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>50%</th>\n",
       "      <td>49.000000</td>\n",
       "      <td>113.00000</td>\n",
       "      <td>2023-01-03 00:00:00</td>\n",
       "      <td>22.000000</td>\n",
       "      <td>158.000000</td>\n",
       "      <td>480.000000</td>\n",
       "      <td>3.23500</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>75%</th>\n",
       "      <td>74.000000</td>\n",
       "      <td>119.00000</td>\n",
       "      <td>2023-01-04 00:00:00</td>\n",
       "      <td>28.000000</td>\n",
       "      <td>226.000000</td>\n",
       "      <td>774.000000</td>\n",
       "      <td>4.20000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>max</th>\n",
       "      <td>100.000000</td>\n",
       "      <td>125.00000</td>\n",
       "      <td>2023-01-05 00:00:00</td>\n",
       "      <td>35.000000</td>\n",
       "      <td>298.000000</td>\n",
       "      <td>962.000000</td>\n",
       "      <td>4.92000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>std</th>\n",
       "      <td>29.202463</td>\n",
       "      <td>7.37296</td>\n",
       "      <td>NaN</td>\n",
       "      <td>9.906469</td>\n",
       "      <td>73.424575</td>\n",
       "      <td>296.960088</td>\n",
       "      <td>1.36245</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       customer_id  product_id               purchase_date   page_views  \\\n",
       "count  1000.000000  1000.00000                        1000  1000.000000   \n",
       "mean     49.077000   112.73200  2023-01-02 22:45:07.200000    19.996000   \n",
       "min       1.000000   101.00000         2023-01-01 00:00:00     2.000000   \n",
       "25%      24.000000   106.00000         2023-01-02 00:00:00    11.000000   \n",
       "50%      49.000000   113.00000         2023-01-03 00:00:00    22.000000   \n",
       "75%      74.000000   119.00000         2023-01-04 00:00:00    28.000000   \n",
       "max     100.000000   125.00000         2023-01-05 00:00:00    35.000000   \n",
       "std      29.202463     7.37296                         NaN     9.906469   \n",
       "\n",
       "        time_spent        price     ratings  \n",
       "count  1000.000000  1000.000000  1000.00000  \n",
       "mean    163.470000   469.828000     3.03613  \n",
       "min      50.000000    30.000000     1.06000  \n",
       "25%      98.000000   146.000000     1.48000  \n",
       "50%     158.000000   480.000000     3.23500  \n",
       "75%     226.000000   774.000000     4.20000  \n",
       "max     298.000000   962.000000     4.92000  \n",
       "std      73.424575   296.960088     1.36245  "
      ]
     },
     "execution_count": 252,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_ph_final.describe()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 253,
   "id": "04bb7bda-847d-41db-8b2b-490d62401368",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "customer_id\n",
       "3     21\n",
       "49    17\n",
       "70    16\n",
       "38    15\n",
       "16    15\n",
       "82    15\n",
       "84    14\n",
       "23    14\n",
       "18    14\n",
       "63    14\n",
       "Name: count, dtype: int64"
      ]
     },
     "execution_count": 253,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_ph_final['customer_id'].value_counts().head(10)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 254,
   "id": "071c9a8b-dd76-46f7-9ace-4af99b5bdc51",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "product_id\n",
       "103    50\n",
       "105    49\n",
       "102    48\n",
       "121    46\n",
       "101    44\n",
       "118    43\n",
       "108    43\n",
       "117    42\n",
       "115    42\n",
       "107    41\n",
       "Name: count, dtype: int64"
      ]
     },
     "execution_count": 254,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_ph_final['product_id'].value_counts().head(10)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 255,
   "id": "7f278206-cc47-4760-bef2-2f510df2ef4b",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "image/png": "",
      "text/plain": [
       "<Figure size 640x480 with 1 Axes>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "sns.countplot(df_ph_final['category']);"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 256,
   "id": "c34e8603-d749-4a84-8d68-8712d0824a3f",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<Axes: xlabel='time_spent', ylabel='Count'>"
      ]
     },
     "execution_count": 256,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "image/png": "",
      "text/plain": [
       "<Figure size 640x480 with 1 Axes>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "sns.histplot(df_ph_final['time_spent'], bins=20)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 257,
   "id": "82aa5fbf-4ca3-44e7-8e95-672a361652eb",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<Axes: xlabel='page_views', ylabel='Count'>"
      ]
     },
     "execution_count": 257,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "image/png": "",
      "text/plain": [
       "<Figure size 640x480 with 1 Axes>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "sns.histplot(df_ph_final['page_views'], bins=20)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6cfe0417-c13e-4a7d-889d-39e9bc50e75c",
   "metadata": {},
   "source": [
    "- distribution of category, time spent and page views variables"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 258,
   "id": "ae2bcbc4-853b-4098-af03-27a171ea1da2",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead tr th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe thead tr:last-of-type th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr>\n",
       "      <th></th>\n",
       "      <th colspan=\"2\" halign=\"left\">price</th>\n",
       "      <th colspan=\"2\" halign=\"left\">page_views</th>\n",
       "      <th colspan=\"2\" halign=\"left\">time_spent</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th></th>\n",
       "      <th>sum</th>\n",
       "      <th>mean</th>\n",
       "      <th>sum</th>\n",
       "      <th>mean</th>\n",
       "      <th>sum</th>\n",
       "      <th>mean</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>customer_id</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>6756</td>\n",
       "      <td>519.692308</td>\n",
       "      <td>325</td>\n",
       "      <td>25.0</td>\n",
       "      <td>1560</td>\n",
       "      <td>120.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>4098</td>\n",
       "      <td>372.545455</td>\n",
       "      <td>220</td>\n",
       "      <td>20.0</td>\n",
       "      <td>990</td>\n",
       "      <td>90.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>11587</td>\n",
       "      <td>551.761905</td>\n",
       "      <td>630</td>\n",
       "      <td>30.0</td>\n",
       "      <td>3150</td>\n",
       "      <td>150.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>4542</td>\n",
       "      <td>324.428571</td>\n",
       "      <td>210</td>\n",
       "      <td>15.0</td>\n",
       "      <td>1120</td>\n",
       "      <td>80.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>5715</td>\n",
       "      <td>519.545455</td>\n",
       "      <td>242</td>\n",
       "      <td>22.0</td>\n",
       "      <td>1210</td>\n",
       "      <td>110.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>96</th>\n",
       "      <td>5888</td>\n",
       "      <td>490.666667</td>\n",
       "      <td>144</td>\n",
       "      <td>12.0</td>\n",
       "      <td>3360</td>\n",
       "      <td>280.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>97</th>\n",
       "      <td>988</td>\n",
       "      <td>197.600000</td>\n",
       "      <td>125</td>\n",
       "      <td>25.0</td>\n",
       "      <td>680</td>\n",
       "      <td>136.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>98</th>\n",
       "      <td>4879</td>\n",
       "      <td>443.545455</td>\n",
       "      <td>242</td>\n",
       "      <td>22.0</td>\n",
       "      <td>572</td>\n",
       "      <td>52.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>99</th>\n",
       "      <td>3547</td>\n",
       "      <td>394.111111</td>\n",
       "      <td>117</td>\n",
       "      <td>13.0</td>\n",
       "      <td>1629</td>\n",
       "      <td>181.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>100</th>\n",
       "      <td>5552</td>\n",
       "      <td>504.727273</td>\n",
       "      <td>319</td>\n",
       "      <td>29.0</td>\n",
       "      <td>1133</td>\n",
       "      <td>103.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>100 rows × 6 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "             price             page_views       time_spent       \n",
       "               sum        mean        sum  mean        sum   mean\n",
       "customer_id                                                      \n",
       "1             6756  519.692308        325  25.0       1560  120.0\n",
       "2             4098  372.545455        220  20.0        990   90.0\n",
       "3            11587  551.761905        630  30.0       3150  150.0\n",
       "4             4542  324.428571        210  15.0       1120   80.0\n",
       "5             5715  519.545455        242  22.0       1210  110.0\n",
       "...            ...         ...        ...   ...        ...    ...\n",
       "96            5888  490.666667        144  12.0       3360  280.0\n",
       "97             988  197.600000        125  25.0        680  136.0\n",
       "98            4879  443.545455        242  22.0        572   52.0\n",
       "99            3547  394.111111        117  13.0       1629  181.0\n",
       "100           5552  504.727273        319  29.0       1133  103.0\n",
       "\n",
       "[100 rows x 6 columns]"
      ]
     },
     "execution_count": 258,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_ph_final.groupby(['customer_id']).agg({\n",
    "    'price' : ['sum', 'mean'],\n",
    "    'page_views' : ['sum', 'mean'],\n",
    "    'time_spent': ['sum', 'mean']\n",
    "})"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 259,
   "id": "0e0db145-c718-4d9d-8034-5519f1756c1b",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "purchase_date\n",
       "2023-01-02    220\n",
       "2023-01-03    208\n",
       "2023-01-01    199\n",
       "2023-01-05    193\n",
       "2023-01-04    180\n",
       "Name: count, dtype: int64"
      ]
     },
     "execution_count": 259,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_ph_final['purchase_date'].value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "db2d93bb-ae4c-47a9-82ad-d43bf81f85ed",
   "metadata": {},
   "source": [
    "## Data Preprocessing"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 260,
   "id": "7b4c2115-bcfd-4b3f-919c-cc3a95471326",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>customer_id</th>\n",
       "      <th>product_id</th>\n",
       "      <th>purchase_date</th>\n",
       "      <th>page_views</th>\n",
       "      <th>time_spent</th>\n",
       "      <th>category</th>\n",
       "      <th>price</th>\n",
       "      <th>ratings</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>101</td>\n",
       "      <td>2023-01-01</td>\n",
       "      <td>25</td>\n",
       "      <td>120</td>\n",
       "      <td>Electronics</td>\n",
       "      <td>500</td>\n",
       "      <td>4.50</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>3</td>\n",
       "      <td>101</td>\n",
       "      <td>2023-01-01</td>\n",
       "      <td>30</td>\n",
       "      <td>150</td>\n",
       "      <td>Electronics</td>\n",
       "      <td>500</td>\n",
       "      <td>4.50</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>3</td>\n",
       "      <td>101</td>\n",
       "      <td>2023-01-03</td>\n",
       "      <td>30</td>\n",
       "      <td>150</td>\n",
       "      <td>Electronics</td>\n",
       "      <td>500</td>\n",
       "      <td>4.50</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>4</td>\n",
       "      <td>101</td>\n",
       "      <td>2023-01-02</td>\n",
       "      <td>15</td>\n",
       "      <td>80</td>\n",
       "      <td>Electronics</td>\n",
       "      <td>500</td>\n",
       "      <td>4.50</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>5</td>\n",
       "      <td>101</td>\n",
       "      <td>2023-01-05</td>\n",
       "      <td>22</td>\n",
       "      <td>110</td>\n",
       "      <td>Electronics</td>\n",
       "      <td>500</td>\n",
       "      <td>4.50</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>995</th>\n",
       "      <td>21</td>\n",
       "      <td>106</td>\n",
       "      <td>2023-01-05</td>\n",
       "      <td>26</td>\n",
       "      <td>100</td>\n",
       "      <td>Home &amp; Kitchen</td>\n",
       "      <td>678</td>\n",
       "      <td>2.74</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>996</th>\n",
       "      <td>17</td>\n",
       "      <td>106</td>\n",
       "      <td>2023-01-01</td>\n",
       "      <td>16</td>\n",
       "      <td>280</td>\n",
       "      <td>Home &amp; Kitchen</td>\n",
       "      <td>678</td>\n",
       "      <td>2.74</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>997</th>\n",
       "      <td>43</td>\n",
       "      <td>106</td>\n",
       "      <td>2023-01-05</td>\n",
       "      <td>6</td>\n",
       "      <td>217</td>\n",
       "      <td>Home &amp; Kitchen</td>\n",
       "      <td>678</td>\n",
       "      <td>2.74</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>998</th>\n",
       "      <td>91</td>\n",
       "      <td>106</td>\n",
       "      <td>2023-01-01</td>\n",
       "      <td>29</td>\n",
       "      <td>234</td>\n",
       "      <td>Home &amp; Kitchen</td>\n",
       "      <td>678</td>\n",
       "      <td>2.74</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>999</th>\n",
       "      <td>90</td>\n",
       "      <td>106</td>\n",
       "      <td>2023-01-04</td>\n",
       "      <td>11</td>\n",
       "      <td>213</td>\n",
       "      <td>Home &amp; Kitchen</td>\n",
       "      <td>678</td>\n",
       "      <td>2.74</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>1000 rows × 8 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "     customer_id  product_id purchase_date  page_views  time_spent  \\\n",
       "0              1         101    2023-01-01          25         120   \n",
       "1              3         101    2023-01-01          30         150   \n",
       "2              3         101    2023-01-03          30         150   \n",
       "3              4         101    2023-01-02          15          80   \n",
       "4              5         101    2023-01-05          22         110   \n",
       "..           ...         ...           ...         ...         ...   \n",
       "995           21         106    2023-01-05          26         100   \n",
       "996           17         106    2023-01-01          16         280   \n",
       "997           43         106    2023-01-05           6         217   \n",
       "998           91         106    2023-01-01          29         234   \n",
       "999           90         106    2023-01-04          11         213   \n",
       "\n",
       "           category  price  ratings  \n",
       "0       Electronics    500     4.50  \n",
       "1       Electronics    500     4.50  \n",
       "2       Electronics    500     4.50  \n",
       "3       Electronics    500     4.50  \n",
       "4       Electronics    500     4.50  \n",
       "..              ...    ...      ...  \n",
       "995  Home & Kitchen    678     2.74  \n",
       "996  Home & Kitchen    678     2.74  \n",
       "997  Home & Kitchen    678     2.74  \n",
       "998  Home & Kitchen    678     2.74  \n",
       "999  Home & Kitchen    678     2.74  \n",
       "\n",
       "[1000 rows x 8 columns]"
      ]
     },
     "execution_count": 260,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_ph_final"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ba9ccc12-53fd-465b-8f29-e4bd9d3be185",
   "metadata": {},
   "source": [
    "- the data seems good and no need for data preprocessing"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c446b5de-46b1-4ccf-91ce-417fe350b48e",
   "metadata": {},
   "source": [
    "## Model Development\n",
    "- using KNNwithMeans from scikit surprise\n",
    "- this model capables to find nearest neighbors and predict the ratings (to indicate user preference)\n",
    "- our goal is to predict top 5 product with highest ratings related to customers\n",
    "- my hipothesis is higher ratings mean higher probability customers will buy the products"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 373,
   "id": "cfbffc6d-7c44-4449-9d95-cad2db7104a4",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>customer_id</th>\n",
       "      <th>product_id</th>\n",
       "      <th>purchase_date</th>\n",
       "      <th>page_views</th>\n",
       "      <th>time_spent</th>\n",
       "      <th>category</th>\n",
       "      <th>price</th>\n",
       "      <th>ratings</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>101</td>\n",
       "      <td>2023-01-01</td>\n",
       "      <td>25</td>\n",
       "      <td>120</td>\n",
       "      <td>Electronics</td>\n",
       "      <td>500</td>\n",
       "      <td>4.5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>3</td>\n",
       "      <td>101</td>\n",
       "      <td>2023-01-01</td>\n",
       "      <td>30</td>\n",
       "      <td>150</td>\n",
       "      <td>Electronics</td>\n",
       "      <td>500</td>\n",
       "      <td>4.5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>3</td>\n",
       "      <td>101</td>\n",
       "      <td>2023-01-03</td>\n",
       "      <td>30</td>\n",
       "      <td>150</td>\n",
       "      <td>Electronics</td>\n",
       "      <td>500</td>\n",
       "      <td>4.5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>4</td>\n",
       "      <td>101</td>\n",
       "      <td>2023-01-02</td>\n",
       "      <td>15</td>\n",
       "      <td>80</td>\n",
       "      <td>Electronics</td>\n",
       "      <td>500</td>\n",
       "      <td>4.5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>5</td>\n",
       "      <td>101</td>\n",
       "      <td>2023-01-05</td>\n",
       "      <td>22</td>\n",
       "      <td>110</td>\n",
       "      <td>Electronics</td>\n",
       "      <td>500</td>\n",
       "      <td>4.5</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   customer_id  product_id purchase_date  page_views  time_spent     category  \\\n",
       "0            1         101    2023-01-01          25         120  Electronics   \n",
       "1            3         101    2023-01-01          30         150  Electronics   \n",
       "2            3         101    2023-01-03          30         150  Electronics   \n",
       "3            4         101    2023-01-02          15          80  Electronics   \n",
       "4            5         101    2023-01-05          22         110  Electronics   \n",
       "\n",
       "   price  ratings  \n",
       "0    500      4.5  \n",
       "1    500      4.5  \n",
       "2    500      4.5  \n",
       "3    500      4.5  \n",
       "4    500      4.5  "
      ]
     },
     "execution_count": 373,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_ph_final.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 414,
   "id": "8734beed-20df-49e9-ac21-754adfc1639c",
   "metadata": {},
   "outputs": [],
   "source": [
    "reader = Reader(rating_scale=(1, 5))\n",
    "\n",
    "# Loads Pandas dataframe\n",
    "data = Dataset.load_from_df(df_ph_final[[\"customer_id\", \"product_id\", \"ratings\"]], reader)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3727fb26-8d6d-4438-a9e9-eff23df5220c",
   "metadata": {},
   "source": [
    "- create datasets"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 415,
   "id": "6c266fed-6d16-4633-b002-ca1f13bda70a",
   "metadata": {},
   "outputs": [],
   "source": [
    "sim_options = {\n",
    "    \"name\": \"cosine\",\n",
    "    \"user_based\": True,\n",
    "}\n",
    "algo = KNNWithMeans(sim_options=sim_options)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d3d689ac-1d96-4a7c-8889-9592c6ef1a13",
   "metadata": {},
   "source": [
    "- define models"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 416,
   "id": "f1f15533-473d-4571-80e6-f54a51ccd85c",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Computing the cosine similarity matrix...\n",
      "Done computing similarity matrix.\n"
     ]
    }
   ],
   "source": [
    "trainset, testset = train_test_split(data, test_size=0.25)\n",
    "algo.fit(trainset)\n",
    "predictions = algo.test(testset)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "61942c65-fefc-4cb2-9532-714058513884",
   "metadata": {},
   "source": [
    "- training for the models"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 417,
   "id": "a7b4cf3e-d850-4dc8-b5bf-7841586b0816",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "RMSE: 0.4993\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "0.4992596085741228"
      ]
     },
     "execution_count": 417,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "rmse(predictions)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "65e3d046-f24d-4adb-8a91-8d90c18e4b94",
   "metadata": {},
   "source": [
    "- using RMSE to evaluate the models and the RMSE seems good around 0.4993"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 418,
   "id": "80bd5e5e-1945-4f81-8fe5-9d2d65c189c2",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['KNN_Model.joblib']"
      ]
     },
     "execution_count": 418,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "joblib.dump(algo, \"KNN_Model.joblib\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3fbf727c-ebd9-4476-afbc-f5a7c5a14aaa",
   "metadata": {},
   "source": [
    "- save model "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8b5c8a49-fce3-42a3-a342-7d3ef3df3954",
   "metadata": {},
   "source": [
    "### testing"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 420,
   "id": "b390e5a6-01ae-46c4-8df0-493f75fe2b02",
   "metadata": {},
   "outputs": [],
   "source": [
    "list_predicted = []\n",
    "\n",
    "for id in df_ph_final['product_id'].unique():\n",
    "    preds = list(algo.predict(1, id))\n",
    "    product_id = preds[1]\n",
    "    product_score = preds[3]\n",
    "\n",
    "    list_predicted.append((product_id, product_score))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 425,
   "id": "423ab39c-9232-4ca5-a395-1d5366e6096b",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([101, 105, 120, 110, 102, 115, 103, 103, 122, 114, 114, 124, 121])"
      ]
     },
     "execution_count": 425,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_ph_final[df_ph_final['customer_id']==1]['product_id'].values"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 430,
   "id": "c3e97249-37b0-4f43-9849-aa30df423169",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[105, 121, 118, 114, 101]"
      ]
     },
     "execution_count": 430,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "top_5_products = sorted(list_predicted, key=lambda x:x[1], reverse=True)[:5]\n",
    "top_5_products = [product[0] for product in top_5_products]\n",
    "top_5_products"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0159ed80-6adc-4238-af18-fc1701da965c",
   "metadata": {},
   "source": [
    "- so from the historical data, the models predict that\n",
    "- the most likely products that the user will buy are 105, 121, 118, 114, 101"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "de0860a0-47ed-432b-ba09-26bd41e85d7c",
   "metadata": {},
   "source": [
    "## Web Apps Development\n",
    "- will use gradio and will hosting to huggingface"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 447,
   "id": "f73f026e-8692-48b6-af28-411439bac649",
   "metadata": {},
   "outputs": [],
   "source": [
    "df_ph_final.to_csv(\"data_final.csv\", index=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 437,
   "id": "f9db4891-7e9c-406f-9487-faf2ccd8c1b8",
   "metadata": {},
   "outputs": [],
   "source": [
    "def product_recommender(customer_id):\n",
    "    list_predicted = []\n",
    "    \n",
    "    for id in df_ph_final['product_id'].unique():\n",
    "        preds = list(algo.predict(customer_id, id))\n",
    "        product_id = preds[1]\n",
    "        product_score = preds[3]\n",
    "    \n",
    "        list_predicted.append((product_id, product_score))\n",
    "        \n",
    "    top_5_products_raw = sorted(list_predicted, key=lambda x:x[1], reverse=True)[:5]\n",
    "    top_5_products = [product[0] for product in top_5_products_raw]\n",
    "\n",
    "    product_1_category = df_ph_final[df_ph_final['product_id']==top_5_products[0]]['category'].values[0]\n",
    "    product_2_category = df_ph_final[df_ph_final['product_id']==top_5_products[1]]['category'].values[0]\n",
    "    product_3_category = df_ph_final[df_ph_final['product_id']==top_5_products[2]]['category'].values[0]\n",
    "    product_4_category = df_ph_final[df_ph_final['product_id']==top_5_products[3]]['category'].values[0]\n",
    "    product_5_category = df_ph_final[df_ph_final['product_id']==top_5_products[4]]['category'].values[0]\n",
    "\n",
    "    result_1 = f\"Recommendation Product ID {top_5_products[0]} with Category {product_1_category}\"\n",
    "    result_2 = f\"Recommendation Product ID {top_5_products[1]} with Category {product_2_category}\"\n",
    "    result_3 = f\"Recommendation Product ID {top_5_products[2]} with Category {product_3_category}\"\n",
    "    result_4 = f\"Recommendation Product ID {top_5_products[3]} with Category {product_4_category}\"\n",
    "    result_5 = f\"Recommendation Product ID {top_5_products[4]} with Category {product_5_category}\"\n",
    "\n",
    "    return result_1, result_2, result_3, result_4, result_5"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 442,
   "id": "1ea72b31-8ca8-46f1-9833-d33cb4cc8ae7",
   "metadata": {},
   "outputs": [],
   "source": [
    "demo = gr.Interface(\n",
    "    title=\"Product Recommendation System\",\n",
    "    description=\"\"\"This User Interface is Powered by Machine Learning to\n",
    "                Predict the Top 5 of Product that customer likely to buy in the next purchase.\n",
    "                All you need is to Input Customer ID and then the Recommendation will be appear.\"\"\",\n",
    "    fn=product_recommender,\n",
    "    inputs=[\n",
    "        gr.Number(label=\"Input Customer ID\")\n",
    "    ],\n",
    "    outputs=[\n",
    "        gr.Textbox(label=\"Recommendation Product 1\"),\n",
    "        gr.Textbox(label=\"Recommendation Product 2\"),\n",
    "        gr.Textbox(label=\"Recommendation Product 3\"),\n",
    "        gr.Textbox(label=\"Recommendation Product 4\"),\n",
    "        gr.Textbox(label=\"Recommendation Product 5\")\n",
    "    ]\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 443,
   "id": "3d5921b0-816f-4d59-acc2-63e0be081f9f",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Running on local URL:  http://127.0.0.1:7862\n",
      "\n",
      "To create a public link, set `share=True` in `launch()`.\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div><iframe src=\"http://127.0.0.1:7862/\" width=\"100%\" height=\"500\" allow=\"autoplay; camera; microphone; clipboard-read; clipboard-write;\" frameborder=\"0\" allowfullscreen></iframe></div>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": []
     },
     "execution_count": 443,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "demo.launch()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "192d7c24-be73-48a7-8bc0-eb1bc49d3e54",
   "metadata": {},
   "source": [
    "## Scripts for Deployment in Huggingface"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 445,
   "id": "3d4df15e-88f4-4854-bc2c-8267211cd6eb",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Writing requirements.txt\n"
     ]
    }
   ],
   "source": [
    "%%writefile requirements.txt\n",
    "gradio\n",
    "pandas\n",
    "numpy\n",
    "faker\n",
    "scikit-surprise"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 449,
   "id": "408aac2a-9b5c-4797-bd00-7706007f81c1",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Writing app.py\n"
     ]
    }
   ],
   "source": [
    "%%writefile app.py\n",
    "import gradio as gr\n",
    "import pandas as pd\n",
    "import joblib\n",
    "\n",
    "data = pd.read_csv(r\"data_final.csv\")\n",
    "\n",
    "def product_recommender(customer_id):\n",
    "    list_predicted = []\n",
    "    \n",
    "    for id in data['product_id'].unique():\n",
    "        preds = list(algo.predict(customer_id, id))\n",
    "        product_id = preds[1]\n",
    "        product_score = preds[3]\n",
    "    \n",
    "        list_predicted.append((product_id, product_score))\n",
    "        \n",
    "    top_5_products_raw = sorted(list_predicted, key=lambda x:x[1], reverse=True)[:5]\n",
    "    top_5_products = [product[0] for product in top_5_products_raw]\n",
    "\n",
    "    product_1_category = data[data['product_id']==top_5_products[0]]['category'].values[0]\n",
    "    product_2_category = data[data['product_id']==top_5_products[1]]['category'].values[0]\n",
    "    product_3_category = data[data['product_id']==top_5_products[2]]['category'].values[0]\n",
    "    product_4_category = data[data['product_id']==top_5_products[3]]['category'].values[0]\n",
    "    product_5_category = data[data['product_id']==top_5_products[4]]['category'].values[0]\n",
    "\n",
    "    result_1 = f\"Recommendation Product ID {top_5_products[0]} with Category {product_1_category}\"\n",
    "    result_2 = f\"Recommendation Product ID {top_5_products[1]} with Category {product_2_category}\"\n",
    "    result_3 = f\"Recommendation Product ID {top_5_products[2]} with Category {product_3_category}\"\n",
    "    result_4 = f\"Recommendation Product ID {top_5_products[3]} with Category {product_4_category}\"\n",
    "    result_5 = f\"Recommendation Product ID {top_5_products[4]} with Category {product_5_category}\"\n",
    "\n",
    "    return result_1, result_2, result_3, result_4, result_5\n",
    "\n",
    "demo = gr.Interface(\n",
    "    title=\"Product Recommendation System\",\n",
    "    description=\"\"\"This User Interface is Powered by Machine Learning to\n",
    "                Predict the Top 5 of Product that customer likely to buy in the next purchase.\n",
    "                All you need is to Input Customer ID and then the Recommendation will be appear.\"\"\",\n",
    "    fn=product_recommender,\n",
    "    inputs=[\n",
    "        gr.Number(label=\"Input Customer ID\")\n",
    "    ],\n",
    "    outputs=[\n",
    "        gr.Textbox(label=\"Recommendation Product 1\"),\n",
    "        gr.Textbox(label=\"Recommendation Product 2\"),\n",
    "        gr.Textbox(label=\"Recommendation Product 3\"),\n",
    "        gr.Textbox(label=\"Recommendation Product 4\"),\n",
    "        gr.Textbox(label=\"Recommendation Product 5\")\n",
    "    ]\n",
    ")\n",
    "\n",
    "if __name__ == \"__main__\":\n",
    "    demo.launch()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "50dee250-e65d-4ae3-9263-cfacc09ac6ff",
   "metadata": {},
   "source": [
    "- link web apps --> https://huggingface.co/spaces/Adipta/product-recommender\n",
    "- link repository --> https://huggingface.co/spaces/Adipta/product-recommender/tree/main"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "61ac0db1-1c8e-447f-8cb2-70f679d61a59",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.18"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}