GenAIDevTOProd commited on
Commit
a03beaf
·
verified ·
1 Parent(s): aaad046

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +48 -0
README.md CHANGED
@@ -10,4 +10,52 @@ pinned: false
10
  short_description: 'r/technology, r/gaming, r/programming etc search comments '
11
  ---
12
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
13
  Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
10
  short_description: 'r/technology, r/gaming, r/programming etc search comments '
11
  ---
12
 
13
+ # Reddit Semantic Search (Prototype)
14
+
15
+ A lightweight semantic search engine built on Reddit comments using:
16
+ - **Word2Vec embeddings** (trained from scratch on selected subreddits)
17
+ - **FAISS** for fast vector indexing and retrieval
18
+ - **Gradio** for a user-friendly, Reddit-themed interface
19
+
20
+ > ⚠️ This is an independent prototype. Not affiliated with Reddit Inc.
21
+
22
+ ---
23
+
24
+ ## Dataset
25
+
26
+ - Source: [`HuggingFaceGECLM/REDDIT_comments`](https://huggingface.co/datasets/HuggingFaceGECLM/REDDIT_comments)
27
+ - Subreddits used:
28
+ - `askscience`, `gaming`, `technology`, `todayilearned`, `programming`
29
+ - Data was streamed using Hugging Face's `datasets` library and chunked using PySpark.
30
+
31
+ ---
32
+
33
+ ## Project Pipeline
34
+
35
+ 1. **Data Loading & Chunking**
36
+ - Load subreddit splits individually using streaming
37
+ - Group every 5 comments into a single text chunk using PySpark
38
+ - Clean and tokenize text for training
39
+
40
+ 2. **Training Word2Vec**
41
+ - Custom embeddings trained using `gensim`'s Word2Vec on cleaned comment chunks
42
+
43
+ 3. **Vector Indexing (FAISS)**
44
+ - Each chunk embedded by averaging Word2Vec vectors of words
45
+ - Dense vectors indexed using `faiss.IndexFlatL2`
46
+
47
+ 4. **Semantic Search App (Gradio)**
48
+ - Enter your query and select a subreddit filter
49
+ - Retrieves top 5 semantically similar comment chunks
50
+ - Built-in reranking logic can be added later
51
+
52
+ ---
53
+
54
+ ## Run the App
55
+
56
+ ```bash
57
+ pip install -r requirements.txt
58
+ python app.py # or run the notebook
59
+
60
+
61
  Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference