SmolLM Fine-Tuned for Plagiarism Detection

This repository hosts a fine-tuned version of SmolLM (135M Parameters) for detecting plagiarism by classifying sentence pairs as either plagiarized or non-plagiarized. Fine-tuning was performed on the MIT Plagiarism Detection Dataset to enhance the model’s accuracy and performance in identifying textual similarities.

Model Information

Base Model: HuggingFaceTB/SmolLM2-135M-Instruct
Fine-tuned Model Name: jatinmehra/smolLM-fined-tuned-for-PLAGAIRISM_Detection
License: MIT
Language: English
Task: Text Classification
Metrics: Accuracy, F1 Score, Recall

Dataset

The model was fine-tuned on the MIT Plagiarism Detection Dataset, which provides pairs of sentences labeled to indicate whether one is a rephrased version of the other (i.e., plagiarized). This dataset is suited for sentence-level similarity detection, and the labels (1 for plagiarized and 0 for non-plagiarized) offer a straightforward approach to training for binary classification.

Training Procedure

The fine-tuning was done using the transformers library from Hugging Face. Key details include:

Model Architecture: The model was modified for sequence classification with two output labels.
Optimizer: AdamW was used to handle optimization, with a learning rate of 2e-5.
Loss Function: Cross-Entropy Loss was used as the objective function.
Batch Size: Set to 16 for memory and performance balance.
Epochs: Trained for 3 epochs.
Padding: A custom padding token was added to align with SmolLM’s requirements, ensuring smooth tokenization.

Training involved a DataLoader that fed sentence pairs into the model, tokenized with attention masking, truncation, and padding. After training, the model achieved a high accuracy score, around 99.66% on the training dataset.

Usage

This model can be employed directly within the Hugging Face Transformers library to classify sentence pairs as plagiarized or non-plagiarized. Simply load the model and tokenizer from the jatinmehra/smolLM-fine-tuned-for-plagiarism-detection repository, and provide sentence pairs as inputs. The model’s output logits can be interpreted to determine whether plagiarism is detected.

Example:

from transformers import GPT2Tokenizer, LlamaForSequenceClassification

tokenizer = GPT2Tokenizer.from_pretrained("jatinmehra/smolLM-fined-tuned-for-PLAGAIRISM_Detection")

model = LlamaForSequenceClassification.from_pretrained("jatinmehra/smolLM-fined-tuned-for-PLAGAIRISM_Detection", num_labels=2)

model.eval()

Evaluation

During evaluation, the model performed robustly with the following metrics:

Accuracy on Validation set: 96%

Classification Report On Test Set

Accuracy: 96.20%

Class	Precision	Recall	F1-Score	Support
0	0.96	0.97	0.96	36,586
1	0.97	0.96	0.96	36,888

Overall Metrics:

Accuracy: 0.96
Macro Average:
- Precision: 0.96
- Recall: 0.96
- F1-Score: 0.96
Weighted Average:
- Precision: 0.96
- Recall: 0.96
- F1-Score: 0.96
Total Support: 73,474

Model and Tokenizer Saving

Upon completion of fine-tuning, the model and tokenizer were saved for deployment and ease of loading in future projects. They can be loaded from Hugging Face or saved locally for custom applications.

License

This model and associated code are released under the MIT License, allowing for both personal and commercial use.

Connect with Me

I appreciate your support and am happy to connect!
GitHub | Email | LinkedIn | Portfolio

jatinmehra
/

smolLM-fined-tuned-for-PLAGAIRISM_Detection