File size: 1,812 Bytes
4ff2686 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 |
# 🧠 Text Similarity Model using Sentence-BERT
This project fine-tunes a Sentence-BERT model (`paraphrase-MiniLM-L6-v2`) on the **STS Benchmark** English dataset (`stsb_multi_mt`) to perform **semantic similarity scoring** between two text inputs.
---
## 🚀 Features
- 🔁 Fine-tunes `sentence-transformers/paraphrase-MiniLM-L6-v2`
- 🔧 Trained on the `stsb_multi_mt` dataset (English split)
- 🧪 Predicts cosine similarity between sentence pairs (0 to 1)
- ⚙️ Uses a custom PyTorch model and manual training loop
- 💾 Model is saved as `similarity_model.pt`
- 🧠 Supports inference on custom sentence pairs
---
## 📦 Dependencies
Install required libraries:
```python
pip install -q transformers datasets sentence-transformers evaluate --upgrade
```
# 📊 Dataset
- Dataset: stsb_multi_mt
- Split: "en"
- Purpose: Provides sentence pairs with similarity scores ranging from 0 to 5, which are normalized to 0–1 for training.
```python
from datasets import load_dataset
dataset = load_dataset("stsb_multi_mt", name="en", split="train")
dataset = dataset.shuffle(seed=42).select(range(10000)) # Sample subset for faster training
```
## 🏗️ Model Architecture
# ✅ Base Model
- sentence-transformers/paraphrase-MiniLM-L6-v2 (from Hugging Face)
# ✅ Fine-Tuning
- Cosine similarity computed between the CLS token embeddings of two inputs
- Loss: Mean Squared Error (MSE) between predicted similarity and true score
# 🧠 Training
- Epochs: 3
- Optimizer: Adam
- Loss: MSELoss
- Manual training loop using PyTorch
# Files and Structure
📦text-similarity-project
┣ 📜similarity_model.pt # Trained PyTorch model
┣ 📜training_script.py # Full training and inference script
┣ 📜README.md # Documentation
|