Update README.md
Browse files
README.md
CHANGED
@@ -10,4 +10,52 @@ pinned: false
|
|
10 |
short_description: 'r/technology, r/gaming, r/programming etc search comments '
|
11 |
---
|
12 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
13 |
Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
|
|
|
10 |
short_description: 'r/technology, r/gaming, r/programming etc search comments '
|
11 |
---
|
12 |
|
13 |
+
# Reddit Semantic Search (Prototype)
|
14 |
+
|
15 |
+
A lightweight semantic search engine built on Reddit comments using:
|
16 |
+
- **Word2Vec embeddings** (trained from scratch on selected subreddits)
|
17 |
+
- **FAISS** for fast vector indexing and retrieval
|
18 |
+
- **Gradio** for a user-friendly, Reddit-themed interface
|
19 |
+
|
20 |
+
> ⚠️ This is an independent prototype. Not affiliated with Reddit Inc.
|
21 |
+
|
22 |
+
---
|
23 |
+
|
24 |
+
## Dataset
|
25 |
+
|
26 |
+
- Source: [`HuggingFaceGECLM/REDDIT_comments`](https://huggingface.co/datasets/HuggingFaceGECLM/REDDIT_comments)
|
27 |
+
- Subreddits used:
|
28 |
+
- `askscience`, `gaming`, `technology`, `todayilearned`, `programming`
|
29 |
+
- Data was streamed using Hugging Face's `datasets` library and chunked using PySpark.
|
30 |
+
|
31 |
+
---
|
32 |
+
|
33 |
+
## Project Pipeline
|
34 |
+
|
35 |
+
1. **Data Loading & Chunking**
|
36 |
+
- Load subreddit splits individually using streaming
|
37 |
+
- Group every 5 comments into a single text chunk using PySpark
|
38 |
+
- Clean and tokenize text for training
|
39 |
+
|
40 |
+
2. **Training Word2Vec**
|
41 |
+
- Custom embeddings trained using `gensim`'s Word2Vec on cleaned comment chunks
|
42 |
+
|
43 |
+
3. **Vector Indexing (FAISS)**
|
44 |
+
- Each chunk embedded by averaging Word2Vec vectors of words
|
45 |
+
- Dense vectors indexed using `faiss.IndexFlatL2`
|
46 |
+
|
47 |
+
4. **Semantic Search App (Gradio)**
|
48 |
+
- Enter your query and select a subreddit filter
|
49 |
+
- Retrieves top 5 semantically similar comment chunks
|
50 |
+
- Built-in reranking logic can be added later
|
51 |
+
|
52 |
+
---
|
53 |
+
|
54 |
+
## Run the App
|
55 |
+
|
56 |
+
```bash
|
57 |
+
pip install -r requirements.txt
|
58 |
+
python app.py # or run the notebook
|
59 |
+
|
60 |
+
|
61 |
Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
|