File size: 1,812 Bytes
4ff2686
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
# 🧠 Text Similarity Model using Sentence-BERT

This project fine-tunes a Sentence-BERT model (`paraphrase-MiniLM-L6-v2`) on the **STS Benchmark** English dataset (`stsb_multi_mt`) to perform **semantic similarity scoring** between two text inputs.

---

## 🚀 Features

- 🔁 Fine-tunes `sentence-transformers/paraphrase-MiniLM-L6-v2`
- 🔧 Trained on the `stsb_multi_mt` dataset (English split)
- 🧪 Predicts cosine similarity between sentence pairs (0 to 1)
- ⚙️ Uses a custom PyTorch model and manual training loop
- 💾 Model is saved as `similarity_model.pt`
- 🧠 Supports inference on custom sentence pairs

---

## 📦 Dependencies

Install required libraries:

```python
pip install -q transformers datasets sentence-transformers evaluate --upgrade
```

# 📊 Dataset
 - Dataset: stsb_multi_mt
 - Split: "en"
 - Purpose: Provides sentence pairs with similarity scores ranging from 0 to 5, which are normalized to 0–1 for training.

```python

from datasets import load_dataset

dataset = load_dataset("stsb_multi_mt", name="en", split="train")
dataset = dataset.shuffle(seed=42).select(range(10000))  # Sample subset for faster training
```

## 🏗️ Model Architecture
# ✅ Base Model
 - sentence-transformers/paraphrase-MiniLM-L6-v2 (from Hugging Face)

# ✅ Fine-Tuning
 - Cosine similarity computed between the CLS token embeddings of two inputs

 - Loss: Mean Squared Error (MSE) between predicted similarity and true score

# 🧠 Training

 - Epochs: 3

 - Optimizer: Adam

 - Loss: MSELoss

 - Manual training loop using PyTorch

# Files and Structure

📦text-similarity-project
 ┣ 📜similarity_model.pt          # Trained PyTorch model
 ┣ 📜training_script.py           # Full training and inference script
 ┣ 📜README.md                    # Documentation