|
--- |
|
license: mit |
|
datasets: |
|
- nvidia/HelpSteer2 |
|
language: |
|
- en |
|
metrics: |
|
- accuracy |
|
- f1 |
|
- recall |
|
base_model: |
|
- HuggingFaceTB/SmolLM2-135M-Instruct |
|
new_version: jatinmehra/smolLM-fined-tuned-for-PLAGAIRISM_Detection |
|
pipeline_tag: text-classification |
|
library_name: transformers |
|
tags: |
|
- legal |
|
- plagiarism-detection |
|
--- |
|
# SmolLM Fine-Tuned for Plagiarism Detection |
|
|
|
This repository hosts a fine-tuned version of SmolLM (135M Parameters) for detecting plagiarism by classifying sentence pairs as either plagiarized or non-plagiarized. Fine-tuning was performed on the [MIT Plagiarism Detection Dataset](https://www.kaggle.com/datasets/ruvelpereira/mit-plagairism-detection-dataset) to enhance the model’s accuracy and performance in identifying textual similarities. |
|
|
|
## Model Information |
|
|
|
- **Base Model**: HuggingFaceTB/SmolLM2-135M-Instruct |
|
- **Fine-tuned Model Name**: `jatinmehra/smolLM-fined-tuned-for-PLAGAIRISM_Detection` |
|
- **License**: MIT |
|
- **Language**: English |
|
- **Task**: Text Classification |
|
- **Metrics**: Accuracy, F1 Score, Recall |
|
|
|
## Dataset |
|
|
|
The model was fine-tuned on the MIT Plagiarism Detection Dataset, which provides pairs of sentences labeled to indicate whether one is a rephrased version of the other (i.e., plagiarized). This dataset is suited for sentence-level similarity detection, and the labels (`1` for plagiarized and `0` for non-plagiarized) offer a straightforward approach to training for binary classification. |
|
|
|
## Training Procedure |
|
|
|
The fine-tuning was done using the `transformers` library from Hugging Face. Key details include: |
|
|
|
- **Model Architecture**: The model was modified for sequence classification with two output labels. |
|
- **Optimizer**: AdamW was used to handle optimization, with a learning rate of 2e-5. |
|
- **Loss Function**: Cross-Entropy Loss was used as the objective function. |
|
- **Batch Size**: Set to 16 for memory and performance balance. |
|
- **Epochs**: Trained for 3 epochs. |
|
- **Padding**: A custom padding token was added to align with SmolLM’s requirements, ensuring smooth tokenization. |
|
|
|
Training involved a DataLoader that fed sentence pairs into the model, tokenized with attention masking, truncation, and padding. After training, the model achieved a high accuracy score, around 99.66% on the training dataset. |
|
|
|
## Usage |
|
|
|
This model can be employed directly within the Hugging Face Transformers library to classify sentence pairs as plagiarized or non-plagiarized. Simply load the model and tokenizer from the `jatinmehra/smolLM-fine-tuned-for-plagiarism-detection` repository, and provide sentence pairs as inputs. The model’s output logits can be interpreted to determine whether plagiarism is detected. |
|
|
|
## Evaluation |
|
|
|
During evaluation, the model performed robustly with the following metrics: |
|
|
|
- **Accuracy**: Approximately **99.66%** on the training set | **100%** on test set |
|
- **Other Metrics**: f1: **1.0** recall: **1.0** |
|
|
|
## Model and Tokenizer Saving |
|
|
|
Upon completion of fine-tuning, the model and tokenizer were saved for deployment and ease of loading in future projects. They can be loaded from Hugging Face or saved locally for custom applications. |
|
|
|
## License |
|
|
|
This model and associated code are released under the MIT License, allowing for both personal and commercial use. |
|
|
|
### Connect with Me |
|
|
|
I appreciate your support and am happy to connect! |
|
[GitHub](https://github.com/Jatin-Mehra119) | [Email]([email protected]) | [LinkedIn](https://www.linkedin.com/in/jatin-mehra119/) | [Portfolio](https://jatin-mehra119.github.io/Profile/) |