---
title: Book Recommender
emoji: ⚡
colorFrom: indigo
colorTo: gray
sdk: gradio
sdk_version: 5.6.0
app_file: app.py
pinned: false
short_description: A content based book recommender.
---

# Content-Based-Book-Recommender
A HyDE based approach for building recommendation engine.

Try it out: https://huggingface.co/spaces/LunaticMaestro/book-recommender

![image](.resources/preview.png)

## Foreword

- All images are my actual work please source powerpoint of them in `.resources` folder of this repo.

- Code is documentation is as per [Google's Python Style Guide](https://google.github.io/styleguide/pyguide.html).

- ALL files Paths are at set as CONST in beginning of each script, to make it easier while using the paths while inferencing & evaluation; hence not passing as CLI arguments.

- Seed value for code reproducability is set at as CONST as well.

- prefix `z_` in filenames is just to avoid confusion (to human) of which is prebuilt module and which is custom during import.

## Table of Content

- [Running Inference Locally](#running-inference)
    - [Colab 🏎️ & minimal set up](#goolge-colab)
- [10,000 feet Approach overview](#approach)
- Pipeline walkthrough in detail

  *For each part of pipeline there is separate script which needs to be executed, mentioned in respective section along with output screenshots.*
  - [Training](#training-steps)
    - [Step 1: Data Clean](#step-1-data-clean)
    - [Step 2: Generate vectors of the books summaries](#step-2-generate-vectors-of-the-books-summaries)
    - [Step 3: Fine-tune GPT-2 to Hallucinate but with some bounds.](#step-3-fine-tune-gpt-2-to-hallucinate-but-with-some-bounds)

  - [Parts of Inference](#parts-of-inference)
    - [How Recommendation is working](#recommendation-generation)
    - [How Similarity Matching is working](#similarity-matching)

  - [Evaluation Metric & Result](#evaluation-metric--result)

## Running Inference

### Memory Requirements

The code need <2Gb RAM to use both the following. Just CPU works fine for inferencing.

  - https://huggingface.co/openai-community/gpt2 ~500 mb
  - https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2 <500 mb


### Libraries 

`requirements.txt` is set up such that HF can best not create conflict. I developed the code in google colab with following libraries that required manual installation.

```SH
pip install sentence-transformers datasets gradio
```

### Running 

#### Goolge Colab 

```
!pip install sentence-transformers datasets gradio
!git clone https://github.com/LunaticMaestro/Content-Based-Book-Recommender
%cd /content/Content-Based-Book-Recommender
```

```
!python app.py
```
![image](.resources/colab_run.png)

**Access the code at public link**

Also colab is fast. 🏎️ even with CPU only takes 16s

![image](.resources/colab_fast.png)

Sidenotes: 
1. I rewrote the snippets from `z_evaluate.py` to `app.py`, cuz need to handle gradio rendering differently. 
2. DONT set `debug=True` for gradio in HF space, else it doesn't start.
3. Free HF space work differently for persisting models (cache files) across local running (tried in colab space) works  faster. **You will see lot of my commits in HF Space to discover this problem.**

#### Local System 

```SH 
python app.py
```
access at http://localhost:7860/ 

## Approach

![image](.resources/approach.png)

References:
- This is the core idea: https://arxiv.org/abs/2212.10496
- Another work based on same, https://github.com/aws-samples/content-based-item-recommender
- For future, a very complex work https://github.com/HKUDS/LLMRec

## Training Steps

### Step 1: Data Clean

What is taken care
  - unwanted column removal (the first column of index)
  - missing values removal (drop rows)
  - duplicate rows removal.
  
What is not taken care
  - stopword removal, stemming/lemmatization or special character removal 
  
  **because approach is to use the casual language modelling (later steps) hence makes no sense to rip apart the word meaning**


### Observations from `z_cleand_data.ipynb`

- Same title corresponds to different categories

  ![image](.resources/clean_1.png)

- Total 1230 unique titles.

  ![image](.resources/clean_2.png)

**Action**: We are not going to remove them rows that shows same titles (& summaries) with different categories but rather create a separate file for unique titles.

**RUN**: 

```SH
python z_clean_data.py
```

![image](.resources/clean_3.png)


Output: `clean_books_summary.csv`, `unique_titles_books_summary.csv`


### Step 2: Generate vectors of the books summaries.

**WHAT & WHY** 


Here, I am going to use pretrained sentence encoder that will help get the meaning of the sentence. We perform this over `unique_titles_books_summary.csv` dataset

Caching because the semantic meaning of the summaries (for books to output) are not changed during entire runtime.

![image](.resources/generate_emb.png)


**RUN**:

Use command
```SH
python z_embedding.py
```

Just using CPU should take <1 min

![image](.resources/generate_emb2.png)


Output: `app_cache/summary_vectors.npy`

### Step 3: Fine-tune GPT-2 to Hallucinate but with some bounds.

**What & Why**

Hypothetical Document Extraction (HyDE) in nutshell
  - The **Hypothetical** part of HyDE approach is all about generating random summaries,in short hallucinating. **This is why the approach will work for new book titles**
  - The **Document Extraction** (part of HyDE) is about using these hallucinated summaries to do semantic search on database. 


**Why to fine-tune GPT-2**

1. We want it to hallucinate but withing boundaries i.e. speak words/ language that we have in books_summaries.csv NOT VERY DIFFERENT OUT OF WORLD LOGIC.

2. Prompt Tune such that we can get consistent results. (Screenshot from https://huggingface.co/openai-community/gpt2); The screenshot show the model is mildly consistent.

  ![image](.resources/fine-tune.png)
   
Reference: 
- HyDE Approach, Precise Zero-Shot Dense Retrieval without Relevance Labels https://arxiv.org/pdf/2212.10496
- Prompt design and book summary idea I borrowed from https://github.com/pranavpsv/Genre-Based-Story-Generator
  - I didnt not use his model
      - its lacks most of the categories; (our dataset is different)
      - His code base is too much, can edit it but not worth the effort.
- Fine-tuning code instructions are from https://huggingface.co/docs/transformers/en/tasks/language_modeling

**RUN**


If you want to 

  - push to HF. You must supply your token from huggingface, required to push model to HF

    ```SH
    huggingface-cli login
    ```
  
  - Not Push to HF, then in `z_finetune_gpt.py`:

    - set line 59 ` push_to_hub` to `False`
    - comment line 77 `trainer.push_to_hub()`

We are going to use dataset `clean_books_summary.csv` while triggering this training.

```SH
python z_finetune_gpt.py
```

Image below just shows for 2 epochs, but the one push to my HF https://huggingface.co/LunaticMaestro/gpt2-book-summary-generator is trained for 10 epochs that lasts ~30 mins for 10 epochs with T4 GPU **reduing loss to 0.87 ~ (perplexity = 2.38)**

![image](.resources/fine-tune2.png)


The loss you see is cross-entryopy loss; as ref in the [fine-tuning instructions](https://huggingface.co/docs/transformers/en/tasks/language_modeling) : `Transformers models all have a default task-relevant loss function, so you don’t need to specify one `

So all we care is lower the value better is the model trained :) 

We are NOT going to test this unit model on some test dataset as the model is already proven (its GPT-2 duh!!).
But **we are going to evaluate our HyDE approach end-2-end next to ensure sanity of the approach** that will inherently prove the goodness of this model.

## Parts of Inference

Before discussing evaluation metric let me walk you through two important pieces recommendation generation and similarity matching;

### Recommendation Generation

The generation is handled by functions in script `z_hypothetical_summary.py`. Under the hood following happens

![image](.resources/eval1.png)

**Function Preview** I did the minimal post processing to chop of the `prompt` from the generated summaries before returning the result.

![image](.resources/eval2.png)

### Similarity Matching 

![image](.resources/eval3.png)

![image](.resources/eval4.png)

**Function Preview** Because there are 1230 unique titles so we get the averaged similarity vector of same size.

![image](.resources/eval5.png)

## Evaluation Metric & Result

So for given input title we can get rank (by desc order cosine similarity) of the store title. To evaluate we the entire approach we are going to use a modified version **Mean Reciprocal Rank (MRR)**.

![image](.resources/eval6.png)

Test Plan:
  - Take random 30 samples and compute the mean of their reciprocal ranks. 
  - If we want that our known book titles be in top 5 results then MRR >= 1/5 = 0.2

**RUN**

```SH 
python z_evaluate.py
```

![image](.resources/eval7.png)

The values of TOP_P and TOP_K (i.e. token sampling for our generator model) are sent as `CONST` in the `z_evaluate.py`; The current set of values are borrowed from the work: https://www.kaggle.com/code/tuckerarrants/text-generation-with-huggingface-gpt2#Top-K-and-Top-P-Sampling

MRR = 0.311 implies that there's a good change that the target book will be in rank (1/.311) ~ 3 (third rank) **i.e. within top 5 recommendations**

> TODO: A sampling study can be done to better make this conclusion.