jatinmehra's picture
Update README.md
22c83cb verified
---
license: mit
datasets:
- nvidia/HelpSteer2
language:
- en
metrics:
- accuracy
- f1
- recall
base_model:
- HuggingFaceTB/SmolLM2-135M-Instruct
new_version: jatinmehra/smolLM-fined-tuned-for-PLAGAIRISM_Detection
pipeline_tag: text-classification
library_name: transformers
tags:
- legal
- plagiarism-detection
---
# SmolLM Fine-Tuned for Plagiarism Detection
This repository hosts a fine-tuned version of SmolLM (135M Parameters) for detecting plagiarism by classifying sentence pairs as either plagiarized or non-plagiarized. Fine-tuning was performed on the [MIT Plagiarism Detection Dataset](https://www.kaggle.com/datasets/ruvelpereira/mit-plagairism-detection-dataset) to enhance the model’s accuracy and performance in identifying textual similarities.
## Model Information
- **Base Model**: HuggingFaceTB/SmolLM2-135M-Instruct
- **Fine-tuned Model Name**: `jatinmehra/smolLM-fined-tuned-for-PLAGAIRISM_Detection`
- **License**: MIT
- **Language**: English
- **Task**: Text Classification
- **Metrics**: Accuracy, F1 Score, Recall
## Dataset
The model was fine-tuned on the MIT Plagiarism Detection Dataset, which provides pairs of sentences labeled to indicate whether one is a rephrased version of the other (i.e., plagiarized). This dataset is suited for sentence-level similarity detection, and the labels (`1` for plagiarized and `0` for non-plagiarized) offer a straightforward approach to training for binary classification.
## Training Procedure
The fine-tuning was done using the `transformers` library from Hugging Face. Key details include:
- **Model Architecture**: The model was modified for sequence classification with two output labels.
- **Optimizer**: AdamW was used to handle optimization, with a learning rate of 2e-5.
- **Loss Function**: Cross-Entropy Loss was used as the objective function.
- **Batch Size**: Set to 16 for memory and performance balance.
- **Epochs**: Trained for 3 epochs.
- **Padding**: A custom padding token was added to align with SmolLM’s requirements, ensuring smooth tokenization.
Training involved a DataLoader that fed sentence pairs into the model, tokenized with attention masking, truncation, and padding. After training, the model achieved a high accuracy score, around 99.66% on the training dataset.
## Usage
This model can be employed directly within the Hugging Face Transformers library to classify sentence pairs as plagiarized or non-plagiarized. Simply load the model and tokenizer from the `jatinmehra/smolLM-fine-tuned-for-plagiarism-detection` repository, and provide sentence pairs as inputs. The model’s output logits can be interpreted to determine whether plagiarism is detected.
## Evaluation
During evaluation, the model performed robustly with the following metrics:
- **Accuracy**: Approximately **99.66%** on the training set | **100%** on test set
- **Other Metrics**: f1: **1.0** recall: **1.0**
## Model and Tokenizer Saving
Upon completion of fine-tuning, the model and tokenizer were saved for deployment and ease of loading in future projects. They can be loaded from Hugging Face or saved locally for custom applications.
## License
This model and associated code are released under the MIT License, allowing for both personal and commercial use.
### Connect with Me
I appreciate your support and am happy to connect!
[GitHub](https://github.com/Jatin-Mehra119) | [Email]([email protected]) | [LinkedIn](https://www.linkedin.com/in/jatin-mehra119/) | [Portfolio](https://jatin-mehra119.github.io/Profile/)