{ "cells": [ { "cell_type": "markdown", "id": "61c88034-1243-4ab7-9d20-96398b0d9efd", "metadata": {}, "source": [ "# Case Study: Predictive Analytics for E-commerce\n", "\n", "## Business Context:\n", "- You are hired as a Data Science and AI for an e-commerce company named \"Terra Store.\" Terra Store is looking to enhance its marketing strategy by predicting customer purchase behavior based on historical data. The company wants to build an AI-powered application that can provide insights into which products a customer is likely to purchase next.\n", "\n", "## Problem Statement:\n", "- Terra Store has provided you with a dataset containing information about customer interactions, purchases, and product details. Your task is to develop a web-based AI application that predicts the next product a customer is likely to buy. The application should be user-friendly, allowing marketing teams to target customers more effectively." ] }, { "cell_type": "markdown", "id": "f4f2378e-150a-4caa-9de0-6d4034bded4b", "metadata": {}, "source": [ "## Data Description:\n", "- The dataset includes the following information:\n", "- Customer Interactions:\n", " - Customer ID\n", " - Page views\n", " - Time spent on the website\n", "- Purchase History:\n", " - Customer ID\n", " - Product ID\n", " - Purchase date\n", "- Product Details:\n", " - Product ID\n", " - Category\n", " - Price\n", " - Ratings" ] }, { "cell_type": "markdown", "id": "270ab428-5a85-4b96-8fc2-0b5b803c625d", "metadata": {}, "source": [ "## Step by Step\n", "1. Data exploration\n", "2. Data preprocessing\n", "3. Model Development\n", "4. Web Application Development" ] }, { "cell_type": "markdown", "id": "0730d462-87a6-4b07-af77-ab4e02e77f9f", "metadata": {}, "source": [ "## Data exploration" ] }, { "cell_type": "code", "execution_count": 444, "id": "2c8130e6-7add-4d51-8e5a-aeab1a3bec09", "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "import random\n", "import warnings\n", "import joblib\n", "import gradio as gr\n", "\n", "from faker import Faker\n", "\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "\n", "from surprise import KNNWithMeans\n", "from surprise.model_selection import train_test_split\n", "from surprise.accuracy import rmse\n", "\n", "from surprise import Dataset\n", "from surprise import Reader\n", "\n", "fake = Faker()\n", "warnings.filterwarnings('ignore')\n", "pd.set_option(\"display.max_columns\", 100)" ] }, { "cell_type": "code", "execution_count": 11, "id": "b4100a7e-59e3-4fe6-a0ac-b3c3e3c69621", "metadata": {}, "outputs": [], "source": [ "df_ci = pd.read_csv(\"/Users/Ruangguru/Documents/Project/skilvul/customer_interactions.csv\")\n", "df_pd = pd.read_csv(\"/Users/Ruangguru/Documents/Project/skilvul/product_details.csv\", delimiter=';')\n", "df_ph = pd.read_csv(\"/Users/Ruangguru/Documents/Project/skilvul/purchase_history.csv\", delimiter=';')" ] }, { "cell_type": "markdown", "id": "c01b2fbc-683a-4169-8096-d5c2f61564ed", "metadata": {}, "source": [ "- because the data is very small, so i decided to create syntetics data " ] }, { "cell_type": "markdown", "id": "3036a93f-129e-4a08-a446-441e2dcacfbf", "metadata": {}, "source": [ "### data customer interactions" ] }, { "cell_type": "code", "execution_count": 12, "id": "365d1c9c-069a-476e-84e2-a6cd872ecaca", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
customer_idpage_viewstime_spent
0125120
122090
2330150
341580
4522110
\n", "
" ], "text/plain": [ " customer_id page_views time_spent\n", "0 1 25 120\n", "1 2 20 90\n", "2 3 30 150\n", "3 4 15 80\n", "4 5 22 110" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_ci" ] }, { "cell_type": "code", "execution_count": 169, "id": "9d770909-3e13-4e5f-b18b-f6db1438dfb7", "metadata": {}, "outputs": [], "source": [ "len_fake_data_ci = 95\n", "df_fake_ci = pd.DataFrame()\n", "df_fake_ci = df_fake_ci.assign(page_views = pd.Series(fake.random.randint(1, 35) for i in range(len_fake_data_ci)),\n", " time_spent = pd.Series(fake.random.randint(50, 300) for i in range(len_fake_data_ci)))" ] }, { "cell_type": "code", "execution_count": 170, "id": "d4944ac3-326b-4339-ab4b-e5c0ac555a61", "metadata": {}, "outputs": [], "source": [ "df_fake_ci = df_fake_ci.reset_index().rename(columns={\n", " 'index' : 'customer_id'\n", "})\n", "\n", "df_fake_ci['customer_id'] = df_fake_ci['customer_id'] + 6" ] }, { "cell_type": "code", "execution_count": 173, "id": "4370a491-b716-41ce-9be6-941fc7aa16a3", "metadata": {}, "outputs": [], "source": [ "df_ci_full = pd.concat([df_ci, df_fake_ci], ignore_index=True)" ] }, { "cell_type": "markdown", "id": "4b43983c-1c6a-4bc5-8d86-0744fed3fed1", "metadata": {}, "source": [ "### product details" ] }, { "cell_type": "code", "execution_count": 13, "id": "ea1710f6-b1cc-485e-8e71-de9462cd9b13", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
product_idcategorypriceratingsUnnamed: 4Unnamed: 5Unnamed: 6
0101Electronics5004.5NaNNaNNaN
1102Clothing503.8NaNNaNNaN
2103Home & Kitchen2004.2NaNNaNNaN
3104Beauty304.0NaNNaNNaN
4105Electronics8004.8NaNNaNNaN
\n", "
" ], "text/plain": [ " product_id category price ratings Unnamed: 4 Unnamed: 5 \\\n", "0 101 Electronics 500 4.5 NaN NaN \n", "1 102 Clothing 50 3.8 NaN NaN \n", "2 103 Home & Kitchen 200 4.2 NaN NaN \n", "3 104 Beauty 30 4.0 NaN NaN \n", "4 105 Electronics 800 4.8 NaN NaN \n", "\n", " Unnamed: 6 \n", "0 NaN \n", "1 NaN \n", "2 NaN \n", "3 NaN \n", "4 NaN " ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_pd" ] }, { "cell_type": "code", "execution_count": 196, "id": "5af8face-680b-443f-91c3-bafe78e322a8", "metadata": {}, "outputs": [], "source": [ "len_fake_data_pd = 20\n", "df_fake_pd = pd.DataFrame()\n", "df_fake_pd = df_fake_pd.assign(category = pd.Series(random.choice(['Electronics', 'Clothing', 'Home & Kitchen', 'Beauty']) for i in range(len_fake_data_pd)),\n", " price = pd.Series(fake.random.randint(30, 1000) for i in range(len_fake_data_ci)),\n", " ratings = pd.Series(round(fake.random.uniform(1, 5), 2) for i in range(len_fake_data_ci)))" ] }, { "cell_type": "code", "execution_count": 197, "id": "c2d9c9cf-096f-4eeb-a5b1-30c02d43655b", "metadata": {}, "outputs": [], "source": [ "df_fake_pd = df_fake_pd.reset_index().rename(columns={\n", " 'index' : 'product_id'\n", "})\n", "\n", "df_fake_pd['product_id'] = df_fake_pd['product_id'] + 106" ] }, { "cell_type": "code", "execution_count": 199, "id": "8915489e-fab3-4582-9c4f-2d9e5d5037f8", "metadata": {}, "outputs": [], "source": [ "df_pd_full = pd.concat([df_pd, df_fake_pd], ignore_index=True)" ] }, { "cell_type": "markdown", "id": "9d5c54e3-a9fb-46ce-8614-b33093075196", "metadata": {}, "source": [ "### purchase history" ] }, { "cell_type": "code", "execution_count": 246, "id": "3e8adc2c-12f4-4de1-a329-e6e6500577ae", "metadata": {}, "outputs": [], "source": [ "len_fake_data_ph = 994\n", "df_fake_ph = pd.DataFrame()\n", "df_fake_ph = df_fake_ph.assign(customer_id = pd.Series(fake.random.randint(1, 100) for i in range(len_fake_data_ph)),\n", " product_id = pd.Series(fake.random.randint(101, 125) for i in range(len_fake_data_ph)),\n", " purchase_date = pd.Series(fake.date_between_dates(pd.to_datetime('2023-01-01'), pd.to_datetime('2023-01-06')) for i in range(len_fake_data_ph)))" ] }, { "cell_type": "code", "execution_count": 247, "id": "5e6089c9-16c9-4f72-b722-fb9f96f55af4", "metadata": {}, "outputs": [], "source": [ "df_ph_full = pd.concat([df_ph, df_fake_ph], ignore_index=True)" ] }, { "cell_type": "markdown", "id": "632e3b2b-f489-43f4-8b08-4853745bd259", "metadata": {}, "source": [ "### final data" ] }, { "cell_type": "code", "execution_count": 248, "id": "fb959b97-8ddf-4b99-a985-65096de6e019", "metadata": {}, "outputs": [], "source": [ "df_ph_full = pd.merge(df_ph_full, df_ci_full, on=['customer_id']).drop(columns=['Unnamed: 3', 'Unnamed: 4', 'Unnamed: 5', 'Unnamed: 6'])\n", "df_ph_final = pd.merge(df_ph_full, df_pd_full, on=['product_id']).drop(columns=['Unnamed: 4', 'Unnamed: 5', 'Unnamed: 6'])" ] }, { "cell_type": "code", "execution_count": 249, "id": "be4f2ee4-4dfa-4feb-959c-700861e4725d", "metadata": {}, "outputs": [], "source": [ "df_ph_final['purchase_date'] = pd.to_datetime(df_ph_final['purchase_date'])" ] }, { "cell_type": "code", "execution_count": 250, "id": "3e2980ea-13c3-45db-a862-2b6b0dd2793e", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
customer_idproduct_idpurchase_datepage_viewstime_spentcategorypriceratings
011012023-01-0125120Electronics5004.5
131012023-01-0130150Electronics5004.5
231012023-01-0330150Electronics5004.5
341012023-01-021580Electronics5004.5
451012023-01-0522110Electronics5004.5
\n", "
" ], "text/plain": [ " customer_id product_id purchase_date page_views time_spent category \\\n", "0 1 101 2023-01-01 25 120 Electronics \n", "1 3 101 2023-01-01 30 150 Electronics \n", "2 3 101 2023-01-03 30 150 Electronics \n", "3 4 101 2023-01-02 15 80 Electronics \n", "4 5 101 2023-01-05 22 110 Electronics \n", "\n", " price ratings \n", "0 500 4.5 \n", "1 500 4.5 \n", "2 500 4.5 \n", "3 500 4.5 \n", "4 500 4.5 " ] }, "execution_count": 250, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_ph_final.head()" ] }, { "cell_type": "markdown", "id": "f7a3bd03-0cc0-4436-9c38-83f773955fa8", "metadata": {}, "source": [ "- the final data that i will used to build the model" ] }, { "cell_type": "code", "execution_count": 251, "id": "ead8f76e-0c6b-48d1-a1d4-8516da1d0bb4", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(1000, 8)" ] }, "execution_count": 251, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_ph_final.shape" ] }, { "cell_type": "markdown", "id": "9bd56d0f-8cc3-4019-b538-b53b6b595b6a", "metadata": {}, "source": [ "- total data 1000 with 8 columns" ] }, { "cell_type": "code", "execution_count": 252, "id": "57449465-a0e0-4cff-9bd1-ee0f4de35087", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
customer_idproduct_idpurchase_datepage_viewstime_spentpriceratings
count1000.0000001000.0000010001000.0000001000.0000001000.0000001000.00000
mean49.077000112.732002023-01-02 22:45:07.20000019.996000163.470000469.8280003.03613
min1.000000101.000002023-01-01 00:00:002.00000050.00000030.0000001.06000
25%24.000000106.000002023-01-02 00:00:0011.00000098.000000146.0000001.48000
50%49.000000113.000002023-01-03 00:00:0022.000000158.000000480.0000003.23500
75%74.000000119.000002023-01-04 00:00:0028.000000226.000000774.0000004.20000
max100.000000125.000002023-01-05 00:00:0035.000000298.000000962.0000004.92000
std29.2024637.37296NaN9.90646973.424575296.9600881.36245
\n", "
" ], "text/plain": [ " customer_id product_id purchase_date page_views \\\n", "count 1000.000000 1000.00000 1000 1000.000000 \n", "mean 49.077000 112.73200 2023-01-02 22:45:07.200000 19.996000 \n", "min 1.000000 101.00000 2023-01-01 00:00:00 2.000000 \n", "25% 24.000000 106.00000 2023-01-02 00:00:00 11.000000 \n", "50% 49.000000 113.00000 2023-01-03 00:00:00 22.000000 \n", "75% 74.000000 119.00000 2023-01-04 00:00:00 28.000000 \n", "max 100.000000 125.00000 2023-01-05 00:00:00 35.000000 \n", "std 29.202463 7.37296 NaN 9.906469 \n", "\n", " time_spent price ratings \n", "count 1000.000000 1000.000000 1000.00000 \n", "mean 163.470000 469.828000 3.03613 \n", "min 50.000000 30.000000 1.06000 \n", "25% 98.000000 146.000000 1.48000 \n", "50% 158.000000 480.000000 3.23500 \n", "75% 226.000000 774.000000 4.20000 \n", "max 298.000000 962.000000 4.92000 \n", "std 73.424575 296.960088 1.36245 " ] }, "execution_count": 252, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_ph_final.describe()" ] }, { "cell_type": "code", "execution_count": 253, "id": "04bb7bda-847d-41db-8b2b-490d62401368", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "customer_id\n", "3 21\n", "49 17\n", "70 16\n", "38 15\n", "16 15\n", "82 15\n", "84 14\n", "23 14\n", "18 14\n", "63 14\n", "Name: count, dtype: int64" ] }, "execution_count": 253, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_ph_final['customer_id'].value_counts().head(10)" ] }, { "cell_type": "code", "execution_count": 254, "id": "071c9a8b-dd76-46f7-9ace-4af99b5bdc51", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "product_id\n", "103 50\n", "105 49\n", "102 48\n", "121 46\n", "101 44\n", "118 43\n", "108 43\n", "117 42\n", "115 42\n", "107 41\n", "Name: count, dtype: int64" ] }, "execution_count": 254, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_ph_final['product_id'].value_counts().head(10)" ] }, { "cell_type": "code", "execution_count": 255, "id": "7f278206-cc47-4760-bef2-2f510df2ef4b", "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "sns.countplot(df_ph_final['category']);" ] }, { "cell_type": "code", "execution_count": 256, "id": "c34e8603-d749-4a84-8d68-8712d0824a3f", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 256, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "sns.histplot(df_ph_final['time_spent'], bins=20)" ] }, { "cell_type": "code", "execution_count": 257, "id": "82aa5fbf-4ca3-44e7-8e95-672a361652eb", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 257, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "sns.histplot(df_ph_final['page_views'], bins=20)" ] }, { "cell_type": "markdown", "id": "6cfe0417-c13e-4a7d-889d-39e9bc50e75c", "metadata": {}, "source": [ "- distribution of category, time spent and page views variables" ] }, { "cell_type": "code", "execution_count": 258, "id": "ae2bcbc4-853b-4098-af03-27a171ea1da2", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
pricepage_viewstime_spent
summeansummeansummean
customer_id
16756519.69230832525.01560120.0
24098372.54545522020.099090.0
311587551.76190563030.03150150.0
44542324.42857121015.0112080.0
55715519.54545524222.01210110.0
.....................
965888490.66666714412.03360280.0
97988197.60000012525.0680136.0
984879443.54545524222.057252.0
993547394.11111111713.01629181.0
1005552504.72727331929.01133103.0
\n", "

100 rows × 6 columns

\n", "
" ], "text/plain": [ " price page_views time_spent \n", " sum mean sum mean sum mean\n", "customer_id \n", "1 6756 519.692308 325 25.0 1560 120.0\n", "2 4098 372.545455 220 20.0 990 90.0\n", "3 11587 551.761905 630 30.0 3150 150.0\n", "4 4542 324.428571 210 15.0 1120 80.0\n", "5 5715 519.545455 242 22.0 1210 110.0\n", "... ... ... ... ... ... ...\n", "96 5888 490.666667 144 12.0 3360 280.0\n", "97 988 197.600000 125 25.0 680 136.0\n", "98 4879 443.545455 242 22.0 572 52.0\n", "99 3547 394.111111 117 13.0 1629 181.0\n", "100 5552 504.727273 319 29.0 1133 103.0\n", "\n", "[100 rows x 6 columns]" ] }, "execution_count": 258, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_ph_final.groupby(['customer_id']).agg({\n", " 'price' : ['sum', 'mean'],\n", " 'page_views' : ['sum', 'mean'],\n", " 'time_spent': ['sum', 'mean']\n", "})" ] }, { "cell_type": "code", "execution_count": 259, "id": "0e0db145-c718-4d9d-8034-5519f1756c1b", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "purchase_date\n", "2023-01-02 220\n", "2023-01-03 208\n", "2023-01-01 199\n", "2023-01-05 193\n", "2023-01-04 180\n", "Name: count, dtype: int64" ] }, "execution_count": 259, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_ph_final['purchase_date'].value_counts()" ] }, { "cell_type": "markdown", "id": "db2d93bb-ae4c-47a9-82ad-d43bf81f85ed", "metadata": {}, "source": [ "## Data Preprocessing" ] }, { "cell_type": "code", "execution_count": 260, "id": "7b4c2115-bcfd-4b3f-919c-cc3a95471326", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
customer_idproduct_idpurchase_datepage_viewstime_spentcategorypriceratings
011012023-01-0125120Electronics5004.50
131012023-01-0130150Electronics5004.50
231012023-01-0330150Electronics5004.50
341012023-01-021580Electronics5004.50
451012023-01-0522110Electronics5004.50
...........................
995211062023-01-0526100Home & Kitchen6782.74
996171062023-01-0116280Home & Kitchen6782.74
997431062023-01-056217Home & Kitchen6782.74
998911062023-01-0129234Home & Kitchen6782.74
999901062023-01-0411213Home & Kitchen6782.74
\n", "

1000 rows × 8 columns

\n", "
" ], "text/plain": [ " customer_id product_id purchase_date page_views time_spent \\\n", "0 1 101 2023-01-01 25 120 \n", "1 3 101 2023-01-01 30 150 \n", "2 3 101 2023-01-03 30 150 \n", "3 4 101 2023-01-02 15 80 \n", "4 5 101 2023-01-05 22 110 \n", ".. ... ... ... ... ... \n", "995 21 106 2023-01-05 26 100 \n", "996 17 106 2023-01-01 16 280 \n", "997 43 106 2023-01-05 6 217 \n", "998 91 106 2023-01-01 29 234 \n", "999 90 106 2023-01-04 11 213 \n", "\n", " category price ratings \n", "0 Electronics 500 4.50 \n", "1 Electronics 500 4.50 \n", "2 Electronics 500 4.50 \n", "3 Electronics 500 4.50 \n", "4 Electronics 500 4.50 \n", ".. ... ... ... \n", "995 Home & Kitchen 678 2.74 \n", "996 Home & Kitchen 678 2.74 \n", "997 Home & Kitchen 678 2.74 \n", "998 Home & Kitchen 678 2.74 \n", "999 Home & Kitchen 678 2.74 \n", "\n", "[1000 rows x 8 columns]" ] }, "execution_count": 260, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_ph_final" ] }, { "cell_type": "markdown", "id": "ba9ccc12-53fd-465b-8f29-e4bd9d3be185", "metadata": {}, "source": [ "- the data seems good and no need for data preprocessing" ] }, { "cell_type": "markdown", "id": "c446b5de-46b1-4ccf-91ce-417fe350b48e", "metadata": {}, "source": [ "## Model Development\n", "- using KNNwithMeans from scikit surprise\n", "- this model capables to find nearest neighbors and predict the ratings (to indicate user preference)\n", "- our goal is to predict top 5 product with highest ratings related to customers\n", "- my hipothesis is higher ratings mean higher probability customers will buy the products" ] }, { "cell_type": "code", "execution_count": 373, "id": "cfbffc6d-7c44-4449-9d95-cad2db7104a4", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
customer_idproduct_idpurchase_datepage_viewstime_spentcategorypriceratings
011012023-01-0125120Electronics5004.5
131012023-01-0130150Electronics5004.5
231012023-01-0330150Electronics5004.5
341012023-01-021580Electronics5004.5
451012023-01-0522110Electronics5004.5
\n", "
" ], "text/plain": [ " customer_id product_id purchase_date page_views time_spent category \\\n", "0 1 101 2023-01-01 25 120 Electronics \n", "1 3 101 2023-01-01 30 150 Electronics \n", "2 3 101 2023-01-03 30 150 Electronics \n", "3 4 101 2023-01-02 15 80 Electronics \n", "4 5 101 2023-01-05 22 110 Electronics \n", "\n", " price ratings \n", "0 500 4.5 \n", "1 500 4.5 \n", "2 500 4.5 \n", "3 500 4.5 \n", "4 500 4.5 " ] }, "execution_count": 373, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_ph_final.head()" ] }, { "cell_type": "code", "execution_count": 414, "id": "8734beed-20df-49e9-ac21-754adfc1639c", "metadata": {}, "outputs": [], "source": [ "reader = Reader(rating_scale=(1, 5))\n", "\n", "# Loads Pandas dataframe\n", "data = Dataset.load_from_df(df_ph_final[[\"customer_id\", \"product_id\", \"ratings\"]], reader)" ] }, { "cell_type": "markdown", "id": "3727fb26-8d6d-4438-a9e9-eff23df5220c", "metadata": {}, "source": [ "- create datasets" ] }, { "cell_type": "code", "execution_count": 415, "id": "6c266fed-6d16-4633-b002-ca1f13bda70a", "metadata": {}, "outputs": [], "source": [ "sim_options = {\n", " \"name\": \"cosine\",\n", " \"user_based\": True,\n", "}\n", "algo = KNNWithMeans(sim_options=sim_options)" ] }, { "cell_type": "markdown", "id": "d3d689ac-1d96-4a7c-8889-9592c6ef1a13", "metadata": {}, "source": [ "- define models" ] }, { "cell_type": "code", "execution_count": 416, "id": "f1f15533-473d-4571-80e6-f54a51ccd85c", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Computing the cosine similarity matrix...\n", "Done computing similarity matrix.\n" ] } ], "source": [ "trainset, testset = train_test_split(data, test_size=0.25)\n", "algo.fit(trainset)\n", "predictions = algo.test(testset)" ] }, { "cell_type": "markdown", "id": "61942c65-fefc-4cb2-9532-714058513884", "metadata": {}, "source": [ "- training for the models" ] }, { "cell_type": "code", "execution_count": 417, "id": "a7b4cf3e-d850-4dc8-b5bf-7841586b0816", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "RMSE: 0.4993\n" ] }, { "data": { "text/plain": [ "0.4992596085741228" ] }, "execution_count": 417, "metadata": {}, "output_type": "execute_result" } ], "source": [ "rmse(predictions)" ] }, { "cell_type": "markdown", "id": "65e3d046-f24d-4adb-8a91-8d90c18e4b94", "metadata": {}, "source": [ "- using RMSE to evaluate the models and the RMSE seems good around 0.4993" ] }, { "cell_type": "code", "execution_count": 418, "id": "80bd5e5e-1945-4f81-8fe5-9d2d65c189c2", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['KNN_Model.joblib']" ] }, "execution_count": 418, "metadata": {}, "output_type": "execute_result" } ], "source": [ "joblib.dump(algo, \"KNN_Model.joblib\")" ] }, { "cell_type": "markdown", "id": "3fbf727c-ebd9-4476-afbc-f5a7c5a14aaa", "metadata": {}, "source": [ "- save model " ] }, { "cell_type": "markdown", "id": "8b5c8a49-fce3-42a3-a342-7d3ef3df3954", "metadata": {}, "source": [ "### testing" ] }, { "cell_type": "code", "execution_count": 420, "id": "b390e5a6-01ae-46c4-8df0-493f75fe2b02", "metadata": {}, "outputs": [], "source": [ "list_predicted = []\n", "\n", "for id in df_ph_final['product_id'].unique():\n", " preds = list(algo.predict(1, id))\n", " product_id = preds[1]\n", " product_score = preds[3]\n", "\n", " list_predicted.append((product_id, product_score))" ] }, { "cell_type": "code", "execution_count": 425, "id": "423ab39c-9232-4ca5-a395-1d5366e6096b", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([101, 105, 120, 110, 102, 115, 103, 103, 122, 114, 114, 124, 121])" ] }, "execution_count": 425, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_ph_final[df_ph_final['customer_id']==1]['product_id'].values" ] }, { "cell_type": "code", "execution_count": 430, "id": "c3e97249-37b0-4f43-9849-aa30df423169", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[105, 121, 118, 114, 101]" ] }, "execution_count": 430, "metadata": {}, "output_type": "execute_result" } ], "source": [ "top_5_products = sorted(list_predicted, key=lambda x:x[1], reverse=True)[:5]\n", "top_5_products = [product[0] for product in top_5_products]\n", "top_5_products" ] }, { "cell_type": "markdown", "id": "0159ed80-6adc-4238-af18-fc1701da965c", "metadata": {}, "source": [ "- so from the historical data, the models predict that\n", "- the most likely products that the user will buy are 105, 121, 118, 114, 101" ] }, { "cell_type": "markdown", "id": "de0860a0-47ed-432b-ba09-26bd41e85d7c", "metadata": {}, "source": [ "## Web Apps Development\n", "- will use gradio and will hosting to huggingface" ] }, { "cell_type": "code", "execution_count": 447, "id": "f73f026e-8692-48b6-af28-411439bac649", "metadata": {}, "outputs": [], "source": [ "df_ph_final.to_csv(\"data_final.csv\", index=False)" ] }, { "cell_type": "code", "execution_count": 437, "id": "f9db4891-7e9c-406f-9487-faf2ccd8c1b8", "metadata": {}, "outputs": [], "source": [ "def product_recommender(customer_id):\n", " list_predicted = []\n", " \n", " for id in df_ph_final['product_id'].unique():\n", " preds = list(algo.predict(customer_id, id))\n", " product_id = preds[1]\n", " product_score = preds[3]\n", " \n", " list_predicted.append((product_id, product_score))\n", " \n", " top_5_products_raw = sorted(list_predicted, key=lambda x:x[1], reverse=True)[:5]\n", " top_5_products = [product[0] for product in top_5_products_raw]\n", "\n", " product_1_category = df_ph_final[df_ph_final['product_id']==top_5_products[0]]['category'].values[0]\n", " product_2_category = df_ph_final[df_ph_final['product_id']==top_5_products[1]]['category'].values[0]\n", " product_3_category = df_ph_final[df_ph_final['product_id']==top_5_products[2]]['category'].values[0]\n", " product_4_category = df_ph_final[df_ph_final['product_id']==top_5_products[3]]['category'].values[0]\n", " product_5_category = df_ph_final[df_ph_final['product_id']==top_5_products[4]]['category'].values[0]\n", "\n", " result_1 = f\"Recommendation Product ID {top_5_products[0]} with Category {product_1_category}\"\n", " result_2 = f\"Recommendation Product ID {top_5_products[1]} with Category {product_2_category}\"\n", " result_3 = f\"Recommendation Product ID {top_5_products[2]} with Category {product_3_category}\"\n", " result_4 = f\"Recommendation Product ID {top_5_products[3]} with Category {product_4_category}\"\n", " result_5 = f\"Recommendation Product ID {top_5_products[4]} with Category {product_5_category}\"\n", "\n", " return result_1, result_2, result_3, result_4, result_5" ] }, { "cell_type": "code", "execution_count": 442, "id": "1ea72b31-8ca8-46f1-9833-d33cb4cc8ae7", "metadata": {}, "outputs": [], "source": [ "demo = gr.Interface(\n", " title=\"Product Recommendation System\",\n", " description=\"\"\"This User Interface is Powered by Machine Learning to\n", " Predict the Top 5 of Product that customer likely to buy in the next purchase.\n", " All you need is to Input Customer ID and then the Recommendation will be appear.\"\"\",\n", " fn=product_recommender,\n", " inputs=[\n", " gr.Number(label=\"Input Customer ID\")\n", " ],\n", " outputs=[\n", " gr.Textbox(label=\"Recommendation Product 1\"),\n", " gr.Textbox(label=\"Recommendation Product 2\"),\n", " gr.Textbox(label=\"Recommendation Product 3\"),\n", " gr.Textbox(label=\"Recommendation Product 4\"),\n", " gr.Textbox(label=\"Recommendation Product 5\")\n", " ]\n", ")" ] }, { "cell_type": "code", "execution_count": 443, "id": "3d5921b0-816f-4d59-acc2-63e0be081f9f", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Running on local URL: http://127.0.0.1:7862\n", "\n", "To create a public link, set `share=True` in `launch()`.\n" ] }, { "data": { "text/html": [ "
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [] }, "execution_count": 443, "metadata": {}, "output_type": "execute_result" } ], "source": [ "demo.launch()" ] }, { "cell_type": "markdown", "id": "192d7c24-be73-48a7-8bc0-eb1bc49d3e54", "metadata": {}, "source": [ "## Scripts for Deployment in Huggingface" ] }, { "cell_type": "code", "execution_count": 445, "id": "3d4df15e-88f4-4854-bc2c-8267211cd6eb", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Writing requirements.txt\n" ] } ], "source": [ "%%writefile requirements.txt\n", "gradio\n", "pandas\n", "numpy\n", "faker\n", "scikit-surprise" ] }, { "cell_type": "code", "execution_count": 449, "id": "408aac2a-9b5c-4797-bd00-7706007f81c1", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Writing app.py\n" ] } ], "source": [ "%%writefile app.py\n", "import gradio as gr\n", "import pandas as pd\n", "import joblib\n", "\n", "data = pd.read_csv(r\"data_final.csv\")\n", "\n", "def product_recommender(customer_id):\n", " list_predicted = []\n", " \n", " for id in data['product_id'].unique():\n", " preds = list(algo.predict(customer_id, id))\n", " product_id = preds[1]\n", " product_score = preds[3]\n", " \n", " list_predicted.append((product_id, product_score))\n", " \n", " top_5_products_raw = sorted(list_predicted, key=lambda x:x[1], reverse=True)[:5]\n", " top_5_products = [product[0] for product in top_5_products_raw]\n", "\n", " product_1_category = data[data['product_id']==top_5_products[0]]['category'].values[0]\n", " product_2_category = data[data['product_id']==top_5_products[1]]['category'].values[0]\n", " product_3_category = data[data['product_id']==top_5_products[2]]['category'].values[0]\n", " product_4_category = data[data['product_id']==top_5_products[3]]['category'].values[0]\n", " product_5_category = data[data['product_id']==top_5_products[4]]['category'].values[0]\n", "\n", " result_1 = f\"Recommendation Product ID {top_5_products[0]} with Category {product_1_category}\"\n", " result_2 = f\"Recommendation Product ID {top_5_products[1]} with Category {product_2_category}\"\n", " result_3 = f\"Recommendation Product ID {top_5_products[2]} with Category {product_3_category}\"\n", " result_4 = f\"Recommendation Product ID {top_5_products[3]} with Category {product_4_category}\"\n", " result_5 = f\"Recommendation Product ID {top_5_products[4]} with Category {product_5_category}\"\n", "\n", " return result_1, result_2, result_3, result_4, result_5\n", "\n", "demo = gr.Interface(\n", " title=\"Product Recommendation System\",\n", " description=\"\"\"This User Interface is Powered by Machine Learning to\n", " Predict the Top 5 of Product that customer likely to buy in the next purchase.\n", " All you need is to Input Customer ID and then the Recommendation will be appear.\"\"\",\n", " fn=product_recommender,\n", " inputs=[\n", " gr.Number(label=\"Input Customer ID\")\n", " ],\n", " outputs=[\n", " gr.Textbox(label=\"Recommendation Product 1\"),\n", " gr.Textbox(label=\"Recommendation Product 2\"),\n", " gr.Textbox(label=\"Recommendation Product 3\"),\n", " gr.Textbox(label=\"Recommendation Product 4\"),\n", " gr.Textbox(label=\"Recommendation Product 5\")\n", " ]\n", ")\n", "\n", "if __name__ == \"__main__\":\n", " demo.launch()" ] }, { "cell_type": "markdown", "id": "50dee250-e65d-4ae3-9263-cfacc09ac6ff", "metadata": {}, "source": [ "- link web apps --> https://huggingface.co/spaces/Adipta/product-recommender\n", "- link repository --> https://huggingface.co/spaces/Adipta/product-recommender/tree/main" ] }, { "cell_type": "code", "execution_count": null, "id": "61ac0db1-1c8e-447f-8cb2-70f679d61a59", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.18" } }, "nbformat": 4, "nbformat_minor": 5 }