DeepakKumarMSL's picture
Create README.md
4ff2686 verified
# 🧠 Text Similarity Model using Sentence-BERT
This project fine-tunes a Sentence-BERT model (`paraphrase-MiniLM-L6-v2`) on the **STS Benchmark** English dataset (`stsb_multi_mt`) to perform **semantic similarity scoring** between two text inputs.
---
## 🚀 Features
- 🔁 Fine-tunes `sentence-transformers/paraphrase-MiniLM-L6-v2`
- 🔧 Trained on the `stsb_multi_mt` dataset (English split)
- 🧪 Predicts cosine similarity between sentence pairs (0 to 1)
- ⚙️ Uses a custom PyTorch model and manual training loop
- 💾 Model is saved as `similarity_model.pt`
- 🧠 Supports inference on custom sentence pairs
---
## 📦 Dependencies
Install required libraries:
```python
pip install -q transformers datasets sentence-transformers evaluate --upgrade
```
# 📊 Dataset
- Dataset: stsb_multi_mt
- Split: "en"
- Purpose: Provides sentence pairs with similarity scores ranging from 0 to 5, which are normalized to 0–1 for training.
```python
from datasets import load_dataset
dataset = load_dataset("stsb_multi_mt", name="en", split="train")
dataset = dataset.shuffle(seed=42).select(range(10000)) # Sample subset for faster training
```
## 🏗️ Model Architecture
# ✅ Base Model
- sentence-transformers/paraphrase-MiniLM-L6-v2 (from Hugging Face)
# ✅ Fine-Tuning
- Cosine similarity computed between the CLS token embeddings of two inputs
- Loss: Mean Squared Error (MSE) between predicted similarity and true score
# 🧠 Training
- Epochs: 3
- Optimizer: Adam
- Loss: MSELoss
- Manual training loop using PyTorch
# Files and Structure
📦text-similarity-project
┣ 📜similarity_model.pt # Trained PyTorch model
┣ 📜training_script.py # Full training and inference script
┣ 📜README.md # Documentation