|
# 🧠 Text Similarity Model using Sentence-BERT |
|
|
|
This project fine-tunes a Sentence-BERT model (`paraphrase-MiniLM-L6-v2`) on the **STS Benchmark** English dataset (`stsb_multi_mt`) to perform **semantic similarity scoring** between two text inputs. |
|
|
|
--- |
|
|
|
## 🚀 Features |
|
|
|
- 🔁 Fine-tunes `sentence-transformers/paraphrase-MiniLM-L6-v2` |
|
- 🔧 Trained on the `stsb_multi_mt` dataset (English split) |
|
- 🧪 Predicts cosine similarity between sentence pairs (0 to 1) |
|
- ⚙️ Uses a custom PyTorch model and manual training loop |
|
- 💾 Model is saved as `similarity_model.pt` |
|
- 🧠 Supports inference on custom sentence pairs |
|
|
|
--- |
|
|
|
## 📦 Dependencies |
|
|
|
Install required libraries: |
|
|
|
```python |
|
pip install -q transformers datasets sentence-transformers evaluate --upgrade |
|
``` |
|
|
|
# 📊 Dataset |
|
- Dataset: stsb_multi_mt |
|
- Split: "en" |
|
- Purpose: Provides sentence pairs with similarity scores ranging from 0 to 5, which are normalized to 0–1 for training. |
|
|
|
```python |
|
|
|
from datasets import load_dataset |
|
|
|
dataset = load_dataset("stsb_multi_mt", name="en", split="train") |
|
dataset = dataset.shuffle(seed=42).select(range(10000)) # Sample subset for faster training |
|
``` |
|
|
|
## 🏗️ Model Architecture |
|
# ✅ Base Model |
|
- sentence-transformers/paraphrase-MiniLM-L6-v2 (from Hugging Face) |
|
|
|
# ✅ Fine-Tuning |
|
- Cosine similarity computed between the CLS token embeddings of two inputs |
|
|
|
- Loss: Mean Squared Error (MSE) between predicted similarity and true score |
|
|
|
# 🧠 Training |
|
|
|
- Epochs: 3 |
|
|
|
- Optimizer: Adam |
|
|
|
- Loss: MSELoss |
|
|
|
- Manual training loop using PyTorch |
|
|
|
# Files and Structure |
|
|
|
📦text-similarity-project |
|
┣ 📜similarity_model.pt # Trained PyTorch model |
|
┣ 📜training_script.py # Full training and inference script |
|
┣ 📜README.md # Documentation |
|
|