Spaces:
Running
Running
title: Book Recommender | |
emoji: ⚡ | |
colorFrom: indigo | |
colorTo: gray | |
sdk: gradio | |
sdk_version: 5.6.0 | |
app_file: app.py | |
pinned: false | |
short_description: A content based book recommender. | |
# Content-Based-Book-Recommender | |
A HyDE based approach for building recommendation engine. | |
## Libraries installed separately | |
I used google colab with following libraries extra. | |
NOT storing `requirements.txt` because issue while pushing to HF space. | |
```SH | |
pip install -U sentence-transformers datasets | |
``` | |
## Training Steps | |
**ALL files Paths are at set as CONST in beginning of each script, to make it easier while using the paths while inferencing; hence not passing as CLI arguments** | |
### Step 1: Data Clean | |
I am going to do basic steps like unwanted column removal (the first column of index), missing values removal (drop rows), duplicate rows removal. Output Screenshot attached. | |
I am NOT doing any text pre-processing steps like stopword removal, stemming/lemmatization or special character removal because my approach is to use the casual language modelling (later steps) hence makes no sense to rip apart the word meaning via these word-based techniques. | |
A little tinker in around with the dataset found that some titles can belong to multiple categories. (*this code I ran separately, is not part of any script*) | |
 | |
A descriptive analysis shows that there are just 1230 unique titles. (*this code I ran separately, is not part of any script*) | |
 | |
We are not going to remove them rows that shows same titles (& summaries) with different categories but rather create a separate file for unique titles. | |
```SH | |
python z_clean_data.py | |
``` | |
 | |
Output: `clean_books_summary.csv`, `unique_titles_books_summary.csv` | |
### Step 2: Generate vectors of the books summaries. | |
Here, I am going to use pretrained sentence encoder that will help get the meaning of the sentence. As the semantic meaning of the summaries themselves are not changed. | |
We perform this over `unique_titles_books_summary.csv` dataset | |
 | |
Use command | |
```SH | |
python z_embedding.py | |
``` | |
Just using CPU should take <1 min | |
 | |
Output: `app_cache/summary_vectors.npy` | |
### Step 3: Fine-tune GPT-2 to Hallucinate but with some bounds. | |
Lets address the **Hypothetical** part of HyDE approach. Its all about generating random summaries,in short hallucinating. While the **Document Extraction** (part of HyDE) is about using these hallucinated summaries to do semantic search on database. | |
Two very important reasons why to fine-tune GPT-2 | |
1. We want it to hallucinate but withing boundaries i.e. speak words/ language that we have in books_summaries.csv NOT VERY DIFFERENT OUT OF WORLD LOGIC. | |
2. Prompt Tune such that we can get consistent results. (Screenshot from https://huggingface.co/openai-community/gpt2); The screenshot show the model is mildly consistent. | |
 | |
> we are going to use ``clean_books_summary.csv` dataset in this training to align with the prompt of ingesting different genre. | |
Reference: | |
- HyDE Approach, Precise Zero-Shot Dense Retrieval without Relevance Labels https://arxiv.org/pdf/2212.10496 | |
- Prompt design and book summary idea I borrowed from https://github.com/pranavpsv/Genre-Based-Story-Generator | |
- I didnt not use his model | |
- its lacks most of the categories; (our dataset is different) | |
- His code base is too much, can edit it but not worth the effort. | |
- Fine-tuning code instructions are from https://huggingface.co/docs/transformers/en/tasks/language_modeling | |
Command | |
You must supply your token from huggingface, required to push model to HF | |
```SH | |
huggingface-cli login | |
``` | |
We are going to use dataset `clean_books_summary.csv` while triggering this training. | |
```SH | |
python z_finetune_gpt.py | |
``` | |
(Training lasts ~30 mins for 10 epochs with T4 GPU) | |
 | |
The loss you see is cross-entryopy loss; as ref in the fine-tuning instructions (see above reference) states : `Transformers models all have a default task-relevant loss function, so you don’t need to specify one ` | |
 | |
So all we care is lower the value better is the model trained :) | |
We are NOT going to test this unit model for some test dataset as the model is already proven (its GPT-2 duh!!). | |
But **we are going to evaluate our HyDE approach end-2-end next to ensure sanity of the approach**. | |
## Evaluation | |
Before discussing evaluation metric let me walk you through two important pieces recommendation generation and similarity matching; | |
### Recommendation Generation | |
The generation is handled by script `z_hypothetical_summary.py`. Under the hood following happens | |
 | |
Code Preview. I did the minimal post processing to chop of the `prompt` from the generated summaries before returning the result. | |
 | |
### Similarity Matching | |
 | |
 | |
Because there are 1230 unique titles so we get the averaged similarity vector of same size. | |
 | |
### Evaluation Metric | |
So for given input title we can get rank (by desc order cosine similarity) of the store title. To evaluate we the entire approach we are going to use a modified version **Mean Reciprocal Rank (MRR)**. | |
 | |
We are going to do this for random 30 samples and compute the mean of their reciprocal ranks. Ideally all the title should be ranked 1 and their MRR should be equal to 1. Closer to 1 is good. | |
 | |
The values of TOP_P and TOP_K (i.e. token sampling for our generator model) are sent as `CONST` in the `z_evaluate.py`; The current set of values of this are borrowed from the work: https://www.kaggle.com/code/tuckerarrants/text-generation-with-huggingface-gpt2#Top-K-and-Top-P-Sampling | |
MRR = 0.311 implies that there's a good change that the target book will be in rank (1/.311) ~ 3 (third rank) **i.e. within top 5 recommendations** | |