File size: 3,556 Bytes
22c83cb
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5584c5d
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
---
license: mit
datasets:
- nvidia/HelpSteer2
language:
- en
metrics:
- accuracy
- f1
- recall
base_model:
- HuggingFaceTB/SmolLM2-135M-Instruct
new_version: jatinmehra/smolLM-fined-tuned-for-PLAGAIRISM_Detection
pipeline_tag: text-classification
library_name: transformers
tags:
- legal
- plagiarism-detection
---
# SmolLM Fine-Tuned for Plagiarism Detection

This repository hosts a fine-tuned version of SmolLM (135M Parameters) for detecting plagiarism by classifying sentence pairs as either plagiarized or non-plagiarized. Fine-tuning was performed on the [MIT Plagiarism Detection Dataset](https://www.kaggle.com/datasets/ruvelpereira/mit-plagairism-detection-dataset) to enhance the model’s accuracy and performance in identifying textual similarities.

## Model Information

-   **Base Model**: HuggingFaceTB/SmolLM2-135M-Instruct
-   **Fine-tuned Model Name**: `jatinmehra/smolLM-fined-tuned-for-PLAGAIRISM_Detection`
-   **License**: MIT
-   **Language**: English
-   **Task**: Text Classification
-   **Metrics**: Accuracy, F1 Score, Recall

## Dataset

The model was fine-tuned on the MIT Plagiarism Detection Dataset, which provides pairs of sentences labeled to indicate whether one is a rephrased version of the other (i.e., plagiarized). This dataset is suited for sentence-level similarity detection, and the labels (`1` for plagiarized and `0` for non-plagiarized) offer a straightforward approach to training for binary classification.

## Training Procedure

The fine-tuning was done using the `transformers` library from Hugging Face. Key details include:

-   **Model Architecture**: The model was modified for sequence classification with two output labels.
-   **Optimizer**: AdamW was used to handle optimization, with a learning rate of 2e-5.
-   **Loss Function**: Cross-Entropy Loss was used as the objective function.
-   **Batch Size**: Set to 16 for memory and performance balance.
-   **Epochs**: Trained for 3 epochs.
-   **Padding**: A custom padding token was added to align with SmolLM’s requirements, ensuring smooth tokenization.

Training involved a DataLoader that fed sentence pairs into the model, tokenized with attention masking, truncation, and padding. After training, the model achieved a high accuracy score, around 99.66% on the training dataset.

## Usage

This model can be employed directly within the Hugging Face Transformers library to classify sentence pairs as plagiarized or non-plagiarized. Simply load the model and tokenizer from the `jatinmehra/smolLM-fine-tuned-for-plagiarism-detection` repository, and provide sentence pairs as inputs. The model’s output logits can be interpreted to determine whether plagiarism is detected.

## Evaluation

During evaluation, the model performed robustly with the following metrics:

-   **Accuracy**: Approximately **99.66%** on the training set | **100%** on test set
-   **Other Metrics**:  f1: **1.0** recall: **1.0**

## Model and Tokenizer Saving

Upon completion of fine-tuning, the model and tokenizer were saved for deployment and ease of loading in future projects. They can be loaded from Hugging Face or saved locally for custom applications.

## License

This model and associated code are released under the MIT License, allowing for both personal and commercial use.

### Connect with Me

I appreciate your support and am happy to connect!  
[GitHub](https://github.com/Jatin-Mehra119) | [Email]([email protected]) | [LinkedIn](https://www.linkedin.com/in/jatin-mehra119/) | [Portfolio](https://jatin-mehra119.github.io/Profile/)