Spaces:

GenAIDevTOProd
/

Reddit-SemanticSearch-Prototype

Running

App Files Files Community

GenAIDevTOProd commited on 13 days ago

Commit

a03beaf

verified ·

1 Parent(s): aaad046

Update README.md

Browse files

Files changed (1) hide show

README.md +48 -0

README.md CHANGED Viewed

@@ -10,4 +10,52 @@ pinned: false
 short_description: 'r/technology, r/gaming, r/programming etc search comments '
 ---
 Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 short_description: 'r/technology, r/gaming, r/programming etc search comments '
 ---
+# Reddit Semantic Search (Prototype)
+A lightweight semantic search engine built on Reddit comments using:
+- **Word2Vec embeddings** (trained from scratch on selected subreddits)
+- **FAISS** for fast vector indexing and retrieval
+- **Gradio** for a user-friendly, Reddit-themed interface
+> ⚠️ This is an independent prototype. Not affiliated with Reddit Inc.
+---
+## Dataset
+- Source: [`HuggingFaceGECLM/REDDIT_comments`](https://huggingface.co/datasets/HuggingFaceGECLM/REDDIT_comments)
+- Subreddits used:
+  - `askscience`, `gaming`, `technology`, `todayilearned`, `programming`
+- Data was streamed using Hugging Face's `datasets` library and chunked using PySpark.
+---
+## Project Pipeline
+1. **Data Loading & Chunking**
+   - Load subreddit splits individually using streaming
+   - Group every 5 comments into a single text chunk using PySpark
+   - Clean and tokenize text for training
+2. **Training Word2Vec**
+   - Custom embeddings trained using `gensim`'s Word2Vec on cleaned comment chunks
+3. **Vector Indexing (FAISS)**
+   - Each chunk embedded by averaging Word2Vec vectors of words
+   - Dense vectors indexed using `faiss.IndexFlatL2`
+4. **Semantic Search App (Gradio)**
+   - Enter your query and select a subreddit filter
+   - Retrieves top 5 semantically similar comment chunks
+   - Built-in reranking logic can be added later
+---
+## Run the App
+```bash
+pip install -r requirements.txt
+python app.py  # or run the notebook
 Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference