Spaces:

LunaticMaestro
/

book-recommender

Running

App Files Files Community

Deepak Sahu commited on Nov 23, 2024

Commit

01e2b4e

1 Parent(s): dc7bbeb

section update

Browse files

Files changed (7) hide show

.resources/clean_3.png +3 -0
.resources/fine-tune.png +3 -0
.resources/generate_emb.png +3 -0
.resources/generate_emb2.png +3 -0
README.md +12 -5
z_clean_data.ipynb +286 -0
z_finetune_gpt.py +1 -1

.resources/clean_3.png ADDED Viewed

Git LFS Details

SHA256: 0827e6d97d8d1fc19cc12166b6ac9e89fbc17dc4697e26d28400874b363d3ab9
Pointer size: 130 Bytes
Size of remote file: 16.1 kB

.resources/fine-tune.png ADDED Viewed

Git LFS Details

SHA256: 1491121a40ea49a9a2cdca7eb93f26e29a7387e915423b777b8ec79ddbab352e
Pointer size: 130 Bytes
Size of remote file: 66 kB

.resources/generate_emb.png ADDED Viewed

Git LFS Details

SHA256: c5ad1b606bb8964a9a41613b263d2b13daf38323e908cce5e058fae8fe0f56af
Pointer size: 130 Bytes
Size of remote file: 43.2 kB

.resources/generate_emb2.png ADDED Viewed

Git LFS Details

SHA256: 318d88df5b8a06387bb9c16992ac90af9093771df35601543ccebc7c7d46454a
Pointer size: 130 Bytes
Size of remote file: 49.5 kB

README.md CHANGED Viewed

@@ -115,7 +115,7 @@ What is not taken care
 python z_clean_data.py
 ```
-![image](https://github.com/user-attachments/assets/a466c20b-60ed-47ac-8bfc-e0a38ccdb88d)
 Output: `clean_books_summary.csv`, `unique_titles_books_summary.csv`
@@ -123,11 +123,17 @@ Output: `clean_books_summary.csv`, `unique_titles_books_summary.csv`
 ### Step 2: Generate vectors of the books summaries.
-Here, I am going to use pretrained sentence encoder that will help get the meaning of the sentence. As the semantic meaning of the summaries themselves are not changed.
-We perform this over `unique_titles_books_summary.csv` dataset
-![image](https://github.com/user-attachments/assets/21d2d92b-0ad5-4686-8e38-c47df10893f8)
 Use command
 ```SH
@@ -136,7 +142,8 @@ python z_embedding.py
 Just using CPU should take <1 min
-![image](https://github.com/user-attachments/assets/5765d586-cc50-4adf-b714-5e371f757f38)
 Output: `app_cache/summary_vectors.npy`

 python z_clean_data.py
 ```
+![image](.resources/clean_3.png)
 Output: `clean_books_summary.csv`, `unique_titles_books_summary.csv`
 ### Step 2: Generate vectors of the books summaries.
+**WHAT & WHY**
+Here, I am going to use pretrained sentence encoder that will help get the meaning of the sentence. We perform this over `unique_titles_books_summary.csv` dataset
+Caching because the semantic meaning of the summaries (for books to output) are not changed during entire runtime.
+![image](.resources/generate_emb.png)
+**RUN**:
 Use command
 ```SH
 Just using CPU should take <1 min
+![image](.resources/generate_emb2.png)
 Output: `app_cache/summary_vectors.npy`

z_clean_data.ipynb CHANGED Viewed

	@@ -0,0 +1,286 @@

+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Just Inspection Notebook\n",
+    "\n",
+    "Different from `z_clean_data.py`"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>book_name</th>\n",
+       "      <th>summaries</th>\n",
+       "      <th>categories</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>0</th>\n",
+       "      <td>The Highly Sensitive Person</td>\n",
+       "      <td>is a self-assessment guide and how-to-live tem...</td>\n",
+       "      <td>science</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1</th>\n",
+       "      <td>Why Has Nobody Told Me This Before?</td>\n",
+       "      <td>is a collection of a clinical psychologist’s ...</td>\n",
+       "      <td>science</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>2</th>\n",
+       "      <td>The Midnight Library</td>\n",
+       "      <td>tells the story of Nora, a depressed woman in...</td>\n",
+       "      <td>science</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>3</th>\n",
+       "      <td>Brave New World</td>\n",
+       "      <td>presents a futuristic society engineered perf...</td>\n",
+       "      <td>science</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>4</th>\n",
+       "      <td>1984</td>\n",
+       "      <td>is the story of a man questioning the system ...</td>\n",
+       "      <td>science</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "                             book_name  \\\n",
+       "0          The Highly Sensitive Person   \n",
+       "1  Why Has Nobody Told Me This Before?   \n",
+       "2                 The Midnight Library   \n",
+       "3                      Brave New World   \n",
+       "4                                 1984   \n",
+       "\n",
+       "                                           summaries categories  \n",
+       "0  is a self-assessment guide and how-to-live tem...    science  \n",
+       "1   is a collection of a clinical psychologist’s ...    science  \n",
+       "2   tells the story of Nora, a depressed woman in...    science  \n",
+       "3   presents a futuristic society engineered perf...    science  \n",
+       "4   is the story of a man questioning the system ...    science  "
+      ]
+     },
+     "execution_count": 1,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "from z_utils import get_dataframe \n",
+    "\n",
+    "books_df = get_dataframe(\"books_summary.csv\")\n",
+    "books_df.head()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>book_name</th>\n",
+       "      <th>summaries</th>\n",
+       "      <th>categories</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>2</th>\n",
+       "      <td>The Midnight Library</td>\n",
+       "      <td>tells the story of Nora, a depressed woman in...</td>\n",
+       "      <td>science</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>522</th>\n",
+       "      <td>The Midnight Library</td>\n",
+       "      <td>tells the story of Nora, a depressed woman in...</td>\n",
+       "      <td>relationships</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>788</th>\n",
+       "      <td>The Midnight Library</td>\n",
+       "      <td>tells the story of Nora, a depressed woman in...</td>\n",
+       "      <td>happiness</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1821</th>\n",
+       "      <td>The Midnight Library</td>\n",
+       "      <td>tells the story of Nora, a depressed woman in...</td>\n",
+       "      <td>psychology</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>2402</th>\n",
+       "      <td>The Midnight Library</td>\n",
+       "      <td>tells the story of Nora, a depressed woman in...</td>\n",
+       "      <td>motivation</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>3645</th>\n",
+       "      <td>The Midnight Library</td>\n",
+       "      <td>tells the story of Nora, a depressed woman in...</td>\n",
+       "      <td>creativity</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>3941</th>\n",
+       "      <td>The Midnight Library</td>\n",
+       "      <td>tells the story of Nora, a depressed woman in...</td>\n",
+       "      <td>fiction</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>4305</th>\n",
+       "      <td>The Midnight Library</td>\n",
+       "      <td>tells the story of Nora, a depressed woman in...</td>\n",
+       "      <td>work</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>4665</th>\n",
+       "      <td>The Midnight Library</td>\n",
+       "      <td>tells the story of Nora, a depressed woman in...</td>\n",
+       "      <td>mindfulness</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "                 book_name                                          summaries  \\\n",
+       "2     The Midnight Library   tells the story of Nora, a depressed woman in...   \n",
+       "522   The Midnight Library   tells the story of Nora, a depressed woman in...   \n",
+       "788   The Midnight Library   tells the story of Nora, a depressed woman in...   \n",
+       "1821  The Midnight Library   tells the story of Nora, a depressed woman in...   \n",
+       "2402  The Midnight Library   tells the story of Nora, a depressed woman in...   \n",
+       "3645  The Midnight Library   tells the story of Nora, a depressed woman in...   \n",
+       "3941  The Midnight Library   tells the story of Nora, a depressed woman in...   \n",
+       "4305  The Midnight Library   tells the story of Nora, a depressed woman in...   \n",
+       "4665  The Midnight Library   tells the story of Nora, a depressed woman in...   \n",
+       "\n",
+       "         categories  \n",
+       "2           science  \n",
+       "522   relationships  \n",
+       "788       happiness  \n",
+       "1821     psychology  \n",
+       "2402     motivation  \n",
+       "3645     creativity  \n",
+       "3941        fiction  \n",
+       "4305           work  \n",
+       "4665    mindfulness  "
+      ]
+     },
+     "execution_count": 3,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "books_df[books_df[\"book_name\"] == \"The Midnight Library\"]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "count    1230.000000\n",
+       "mean        4.042276\n",
+       "std         1.985669\n",
+       "min         1.000000\n",
+       "25%         3.000000\n",
+       "50%         4.000000\n",
+       "75%         5.000000\n",
+       "max        12.000000\n",
+       "Name: book_name, dtype: float64"
+      ]
+     },
+     "execution_count": 4,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "books_df[\"book_name\"].value_counts().describe()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.4"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}

z_finetune_gpt.py CHANGED Viewed

@@ -13,7 +13,7 @@ BASE_CASUAL_MODEL = "openai-community/gpt2"
 TRAINED_MODEL_OUTPUT_DIR = "content" # same name for HF Hub
 set_seed(42)
-EPOCHS = 1
 LR = 2e-5
 # Load dataset

 TRAINED_MODEL_OUTPUT_DIR = "content" # same name for HF Hub
 set_seed(42)
+EPOCHS = 2
 LR = 2e-5
 # Load dataset