initial

Browse files

Files changed (12) hide show

README.md +69 -0
app.py +108 -0
model/added_tokens.json +3 -0
model/config.json +31 -0
model/merges.txt +0 -0
model/model.safetensors +3 -0
model/special_tokens_map.json +49 -0
model/tokenizer.json +0 -0
model/tokenizer_config.json +176 -0
model/vocab.json +0 -0
plagairism-fine-tuning using LLM.ipynb +0 -0
test-model.ipynb +329 -0

README.md CHANGED Viewed

@@ -1,3 +1,72 @@
 ---
 license: mit
 ---

 ---
 license: mit
+datasets:
+- nvidia/HelpSteer2
+language:
+- en
+metrics:
+- accuracy
+- f1
+- recall
+base_model:
+- HuggingFaceTB/SmolLM2-135M-Instruct
+new_version: jatinmehra/smolLM-fined-tuned-for-PLAGAIRISM-Detection
+pipeline_tag: text-classification
+library_name: transformers
+tags:
+- legal
+- plagiarism-detection
 ---
+# SmolLM Fine-Tuned for Plagiarism Detection
+This repository hosts a fine-tuned version of SmolLM (135M Parameters) for detecting plagiarism by classifying sentence pairs as either plagiarized or non-plagiarized. Fine-tuning was performed on the [MIT Plagiarism Detection Dataset](https://www.kaggle.com/datasets/ruvelpereira/mit-plagairism-detection-dataset) to enhance the model’s accuracy and performance in identifying textual similarities.
+## Model Information
+-   **Base Model**: HuggingFaceTB/SmolLM2-135M-Instruct
+-   **Fine-tuned Model Name**: `jatinmehra/smolLM-fine-tuned-for-plagiarism-detection`
+-   **License**: MIT
+-   **Language**: English
+-   **Task**: Text Classification
+-   **Metrics**: Accuracy, F1 Score, Recall
+## Dataset
+The model was fine-tuned on the MIT Plagiarism Detection Dataset, which provides pairs of sentences labeled to indicate whether one is a rephrased version of the other (i.e., plagiarized). This dataset is suited for sentence-level similarity detection, and the labels (`1` for plagiarized and `0` for non-plagiarized) offer a straightforward approach to training for binary classification.
+## Training Procedure
+The fine-tuning was done using the `transformers` library from Hugging Face. Key details include:
+-   **Model Architecture**: The model was modified for sequence classification with two output labels.
+-   **Optimizer**: AdamW was used to handle optimization, with a learning rate of 2e-5.
+-   **Loss Function**: Cross-Entropy Loss was used as the objective function.
+-   **Batch Size**: Set to 16 for memory and performance balance.
+-   **Epochs**: Trained for 3 epochs.
+-   **Padding**: A custom padding token was added to align with SmolLM’s requirements, ensuring smooth tokenization.
+Training involved a DataLoader that fed sentence pairs into the model, tokenized with attention masking, truncation, and padding. After training, the model achieved a high accuracy score, around 99.66% on the training dataset.
+## Usage
+This model can be employed directly within the Hugging Face Transformers library to classify sentence pairs as plagiarized or non-plagiarized. Simply load the model and tokenizer from the `jatinmehra/smolLM-fine-tuned-for-plagiarism-detection` repository, and provide sentence pairs as inputs. The model’s output logits can be interpreted to determine whether plagiarism is detected.
+## Evaluation
+During evaluation, the model performed robustly with the following metrics:
+-   **Accuracy**: Approximately **99.66%** on the training set | **100%** on test set
+-   **Other Metrics**:  f1: **1.0** recall: **1.0**
+## Model and Tokenizer Saving
+Upon completion of fine-tuning, the model and tokenizer were saved for deployment and ease of loading in future projects. They can be loaded from Hugging Face or saved locally for custom applications.
+## License
+This model and associated code are released under the MIT License, allowing for both personal and commercial use.
+### Connect with Me
+I appreciate your support and am happy to connect!
+[GitHub](https://github.com/Jatin-Mehra119) | [Email]([email protected]) | [LinkedIn](https://www.linkedin.com/in/jatin-mehra119/) | [Portfolio](https://jatin-mehra119.github.io/Profile/)

app.py ADDED Viewed

	@@ -0,0 +1,108 @@

+import streamlit as st
+import torch
+from transformers import GPT2Tokenizer, LlamaForSequenceClassification
+import fitz  # PyMuPDF for extracting text from PDFs
+import io
+from torch.utils.data import Dataset
+from sklearn.metrics import classification_report
+# Load the tokenizer and model
+model_path = "model"
+tokenizer = GPT2Tokenizer.from_pretrained(model_path, local_files_only=True)
+model = LlamaForSequenceClassification.from_pretrained(model_path, local_files_only=True)
+model.eval()
+device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+model = model.to(device)
+# Function to extract text from a PDF
+def extract_text_from_pdf(pdf_file):
+    # Read the PDF file as a binary stream
+    pdf_bytes = pdf_file.read()
+    # Using BytesIO to convert the binary data into a file-like object
+    pdf_stream = io.BytesIO(pdf_bytes)
+    # Open the PDF using PyMuPDF from the file-like object
+    doc = fitz.open(stream=pdf_stream, filetype="pdf")
+    text = ""
+    for page in doc:
+        text += page.get_text("text")
+    return text
+# Function to preprocess and tokenize the input text
+def preprocess_text(text1, text2):
+    inputs = tokenizer(
+        text1, text2,
+        add_special_tokens=True,
+        max_length=128,
+        padding='max_length',
+        truncation=True,
+        return_tensors="pt"
+    )
+    return inputs
+# Dataset class (similar to your existing one)
+class PlagiarismDataset(Dataset):
+    def __init__(self, text1, text2, tokenizer):
+        self.text1 = text1
+        self.text2 = text2
+        self.tokenizer = tokenizer
+    def __len__(self):
+        return len(self.text1)
+    def __getitem__(self, idx):
+        inputs = preprocess_text(self.text1[idx], self.text2[idx])
+        return {
+            'input_ids': inputs['input_ids'].squeeze(0),
+            'attention_mask': inputs['attention_mask'].squeeze(0)
+        }
+# Function to detect plagiarism using the model
+def detect_plagiarism(text1, text2):
+    dataset = PlagiarismDataset(text1, text2, tokenizer)
+    data_loader = torch.utils.data.DataLoader(dataset, batch_size=1, shuffle=False)
+    predictions = []
+    with torch.no_grad():
+        for batch in data_loader:
+            input_ids = batch['input_ids'].to(device)
+            attention_mask = batch['attention_mask'].to(device)
+            outputs = model(input_ids=input_ids, attention_mask=attention_mask)
+            preds = torch.argmax(outputs.logits, dim=1)
+            predictions.append(preds.item())
+    return predictions[0]
+# Streamlit UI
+st.title("Plagiarism Detection using LLM")
+st.write("Upload two PDFs for plagiarism detection.")
+# Upload PDFs
+pdf_file1 = st.file_uploader("Upload the first PDF", type="pdf")
+pdf_file2 = st.file_uploader("Upload the second PDF", type="pdf")
+if pdf_file1 and pdf_file2:
+    # Extract text from PDFs
+    text1 = extract_text_from_pdf(pdf_file1)
+    text2 = extract_text_from_pdf(pdf_file2)
+    # Display some text from the PDFs for context
+    st.subheader("Text from the first document:")
+    st.text(text1[:1000])  # Display the first 1000 characters of the document
+    st.subheader("Text from the second document:")
+    st.text(text2[:1000])
+    # Detect plagiarism
+    result = detect_plagiarism([text1], [text2])
+    # Display the result
+    if result == 1:
+        st.success("Plagiarism detected!")
+    else:
+        st.success("No plagiarism detected.")

model/added_tokens.json ADDED Viewed

	@@ -0,0 +1,3 @@

+{
+  "[PAD]": 49152
+}

model/config.json ADDED Viewed

	@@ -0,0 +1,31 @@

+{
+  "_name_or_path": "HuggingFaceTB/SmolLM-135M",
+  "architectures": [
+    "LlamaForSequenceClassification"
+  ],
+  "attention_bias": false,
+  "attention_dropout": 0.0,
+  "bos_token_id": 0,
+  "eos_token_id": 0,
+  "head_dim": 64,
+  "hidden_act": "silu",
+  "hidden_size": 576,
+  "initializer_range": 0.02,
+  "intermediate_size": 1536,
+  "max_position_embeddings": 2048,
+  "mlp_bias": false,
+  "model_type": "llama",
+  "num_attention_heads": 9,
+  "num_hidden_layers": 30,
+  "num_key_value_heads": 3,
+  "pad_token_id": 49152,
+  "pretraining_tp": 1,
+  "rms_norm_eps": 1e-05,
+  "rope_scaling": null,
+  "rope_theta": 10000.0,
+  "tie_word_embeddings": true,
+  "torch_dtype": "float32",
+  "transformers_version": "4.45.1",
+  "use_cache": true,
+  "vocab_size": 49153
+}

model/merges.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

model/model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:037f0e1d8903ff226c57c41d8419a5fa9648f7d50e9d093d5bc571139762b30e
+size 538097400

model/special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,49 @@

+{
+  "additional_special_tokens": [
+    "<|endoftext|>",
+    "<|im_start|>",
+    "<|im_end|>",
+    "<repo_name>",
+    "<reponame>",
+    "<file_sep>",
+    "<filename>",
+    "<gh_stars>",
+    "<issue_start>",
+    "<issue_comment>",
+    "<issue_closed>",
+    "<jupyter_start>",
+    "<jupyter_text>",
+    "<jupyter_code>",
+    "<jupyter_output>",
+    "<jupyter_script>",
+    "<empty_output>"
+  ],
+  "bos_token": {
+    "content": "<|endoftext|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "eos_token": {
+    "content": "<|endoftext|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": {
+    "content": "[PAD]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "unk_token": {
+    "content": "<|endoftext|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  }
+}

model/tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

model/tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,176 @@

+{
+  "add_prefix_space": false,
+  "added_tokens_decoder": {
+    "0": {
+      "content": "<|endoftext|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "1": {
+      "content": "<|im_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "2": {
+      "content": "<|im_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "3": {
+      "content": "<repo_name>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "4": {
+      "content": "<reponame>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "5": {
+      "content": "<file_sep>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "6": {
+      "content": "<filename>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "7": {
+      "content": "<gh_stars>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "8": {
+      "content": "<issue_start>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "9": {
+      "content": "<issue_comment>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "10": {
+      "content": "<issue_closed>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "11": {
+      "content": "<jupyter_start>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "12": {
+      "content": "<jupyter_text>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "13": {
+      "content": "<jupyter_code>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "14": {
+      "content": "<jupyter_output>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "15": {
+      "content": "<jupyter_script>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "16": {
+      "content": "<empty_output>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "49152": {
+      "content": "[PAD]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "additional_special_tokens": [
+    "<|endoftext|>",
+    "<|im_start|>",
+    "<|im_end|>",
+    "<repo_name>",
+    "<reponame>",
+    "<file_sep>",
+    "<filename>",
+    "<gh_stars>",
+    "<issue_start>",
+    "<issue_comment>",
+    "<issue_closed>",
+    "<jupyter_start>",
+    "<jupyter_text>",
+    "<jupyter_code>",
+    "<jupyter_output>",
+    "<jupyter_script>",
+    "<empty_output>"
+  ],
+  "bos_token": "<|endoftext|>",
+  "clean_up_tokenization_spaces": false,
+  "eos_token": "<|endoftext|>",
+  "model_max_length": 1000000000000000019884624838656,
+  "pad_token": "[PAD]",
+  "tokenizer_class": "GPT2Tokenizer",
+  "unk_token": "<|endoftext|>",
+  "vocab_size": 49152
+}

model/vocab.json ADDED Viewed

The diff for this file is too large to render. See raw diff

plagairism-fine-tuning using LLM.ipynb ADDED Viewed

The diff for this file is too large to render. See raw diff

test-model.ipynb ADDED Viewed

	@@ -0,0 +1,329 @@

+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "LlamaForSequenceClassification(\n",
+       "  (model): LlamaModel(\n",
+       "    (embed_tokens): Embedding(49153, 576, padding_idx=49152)\n",
+       "    (layers): ModuleList(\n",
+       "      (0-29): 30 x LlamaDecoderLayer(\n",
+       "        (self_attn): LlamaSdpaAttention(\n",
+       "          (q_proj): Linear(in_features=576, out_features=576, bias=False)\n",
+       "          (k_proj): Linear(in_features=576, out_features=192, bias=False)\n",
+       "          (v_proj): Linear(in_features=576, out_features=192, bias=False)\n",
+       "          (o_proj): Linear(in_features=576, out_features=576, bias=False)\n",
+       "          (rotary_emb): LlamaRotaryEmbedding()\n",
+       "        )\n",
+       "        (mlp): LlamaMLP(\n",
+       "          (gate_proj): Linear(in_features=576, out_features=1536, bias=False)\n",
+       "          (up_proj): Linear(in_features=576, out_features=1536, bias=False)\n",
+       "          (down_proj): Linear(in_features=1536, out_features=576, bias=False)\n",
+       "          (act_fn): SiLU()\n",
+       "        )\n",
+       "        (input_layernorm): LlamaRMSNorm((576,), eps=1e-05)\n",
+       "        (post_attention_layernorm): LlamaRMSNorm((576,), eps=1e-05)\n",
+       "      )\n",
+       "    )\n",
+       "    (norm): LlamaRMSNorm((576,), eps=1e-05)\n",
+       "    (rotary_emb): LlamaRotaryEmbedding()\n",
+       "  )\n",
+       "  (score): Linear(in_features=576, out_features=2, bias=False)\n",
+       ")"
+      ]
+     },
+     "execution_count": 1,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "from transformers import GPT2Tokenizer, LlamaForSequenceClassification\n",
+    "\n",
+    "# Load the GPT2 tokenizer and Llama model for sequence classification\n",
+    "model_path = r\"C:\\Users\\jatin\\OneDrive\\Desktop\\plagiarism-detection\\smolLM-fined-tuned-for-PLAGAIRISM-Detection\\model\"\n",
+    "tokenizer = GPT2Tokenizer.from_pretrained(model_path, local_files_only=True)\n",
+    "model = LlamaForSequenceClassification.from_pretrained(model_path, local_files_only=True)\n",
+    "\n",
+    "# Set model to evaluation mode\n",
+    "model.eval()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>sentence1</th>\n",
+       "      <th>sentence2</th>\n",
+       "      <th>label</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>0</th>\n",
+       "      <td>A person on a horse jumps over a broken down a...</td>\n",
+       "      <td>A person is at a diner, ordering an omelette.</td>\n",
+       "      <td>0</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1</th>\n",
+       "      <td>A person on a horse jumps over a broken down a...</td>\n",
+       "      <td>A person is outdoors, on a horse.</td>\n",
+       "      <td>1</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>2</th>\n",
+       "      <td>Children smiling and waving at camera</td>\n",
+       "      <td>There are children present</td>\n",
+       "      <td>1</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>3</th>\n",
+       "      <td>Children smiling and waving at camera</td>\n",
+       "      <td>The kids are frowning</td>\n",
+       "      <td>0</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>4</th>\n",
+       "      <td>A boy is jumping on skateboard in the middle o...</td>\n",
+       "      <td>The boy skates down the sidewalk.</td>\n",
+       "      <td>0</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "                                           sentence1  \\\n",
+       "0  A person on a horse jumps over a broken down a...   \n",
+       "1  A person on a horse jumps over a broken down a...   \n",
+       "2              Children smiling and waving at camera   \n",
+       "3              Children smiling and waving at camera   \n",
+       "4  A boy is jumping on skateboard in the middle o...   \n",
+       "\n",
+       "                                       sentence2  label  \n",
+       "0  A person is at a diner, ordering an omelette.      0  \n",
+       "1              A person is outdoors, on a horse.      1  \n",
+       "2                     There are children present      1  \n",
+       "3                          The kids are frowning      0  \n",
+       "4              The boy skates down the sidewalk.      0  "
+      ]
+     },
+     "execution_count": 2,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "import torch\n",
+    "import pandas as pd\n",
+    "\n",
+    "df = pd.read_csv(\"train_snli.txt\", delimiter='\\t', header=None, names=['sentence1', 'sentence2', 'label'])\n",
+    "\n",
+    "df.head()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from torch.utils.data import Dataset, DataLoader\n",
+    "device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n",
+    "\n",
+    "class PlagiarismDataset(Dataset):\n",
+    "    def __init__(self, df, tokenizer, max_length=128):\n",
+    "        self.df = df\n",
+    "        self.tokenizer = tokenizer\n",
+    "        self.max_length = max_length\n",
+    "\n",
+    "    def __len__(self):\n",
+    "        return len(self.df)\n",
+    "\n",
+    "    def __getitem__(self, index):\n",
+    "        row = self.df.iloc[index]\n",
+    "\n",
+    "        # Ensure the sentences are strings; convert or skip if not\n",
+    "        sentence1 = str(row['sentence1']) if not pd.isna(row['sentence1']) else \"\"\n",
+    "        sentence2 = str(row['sentence2']) if not pd.isna(row['sentence2']) else \"\"\n",
+    "\n",
+    "        inputs = self.tokenizer(\n",
+    "            sentence1, sentence2,\n",
+    "            add_special_tokens=True,\n",
+    "            max_length=self.max_length,\n",
+    "            padding='max_length',\n",
+    "            truncation=True,\n",
+    "            return_tensors=\"pt\"\n",
+    "        )\n",
+    "\n",
+    "        label = torch.tensor(row['label'], dtype=torch.long)\n",
+    "\n",
+    "        return {\n",
+    "            'input_ids': inputs['input_ids'].squeeze(0),\n",
+    "            'attention_mask': inputs['attention_mask'].squeeze(0),\n",
+    "            'label': label\n",
+    "        }\n",
+    "\n",
+    "def collate_fn(batch):\n",
+    "    input_ids = torch.stack([item['input_ids'] for item in batch])\n",
+    "    attention_masks = torch.stack([item['attention_mask'] for item in batch])\n",
+    "    labels = torch.stack([item['label'] for item in batch])\n",
+    "\n",
+    "    return {\n",
+    "        'input_ids': input_ids,\n",
+    "        'attention_mask': attention_masks,\n",
+    "        'label': labels\n",
+    "    }"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "device(type='cuda')"
+      ]
+     },
+     "execution_count": 4,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "device"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Assuming you have a separate test set or validation set (e.g., df_test)\n",
+    "df_test = df[3_66_900:]\n",
+    "# Add padding token if not already\n",
+    "tokenizer.add_special_tokens({'pad_token': '[PAD]'})\n",
+    "\n",
+    "# Resize the model's token embeddings to fit the new tokenizer\n",
+    "model.resize_token_embeddings(len(tokenizer))\n",
+    "\n",
+    "# Create DataLoader for the test set\n",
+    "test_dataset = PlagiarismDataset(df_test, tokenizer)\n",
+    "test_data_loader = DataLoader(test_dataset, batch_size=16, shuffle=False, collate_fn=collate_fn)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Classification Report:\n",
+      "               precision    recall  f1-score   support\n",
+      "\n",
+      "           0       1.00      1.00      1.00       236\n",
+      "           1       1.00      1.00      1.00       237\n",
+      "\n",
+      "    accuracy                           1.00       473\n",
+      "   macro avg       1.00      1.00      1.00       473\n",
+      "weighted avg       1.00      1.00      1.00       473\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "from sklearn.metrics import classification_report\n",
+    "# Function to evaluate model on the test set\n",
+    "# Set up device\n",
+    "device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n",
+    "\n",
+    "# Move model to the appropriate device\n",
+    "model = model.to(device)\n",
+    "\n",
+    "# Function to evaluate the model\n",
+    "def evaluate_model(model, data_loader):\n",
+    "    model.eval()  # Set model to evaluation mode\n",
+    "    preds_list = []\n",
+    "    labels_list = []\n",
+    "\n",
+    "    with torch.no_grad():  # Disable gradient calculation for evaluation\n",
+    "        for batch in data_loader:\n",
+    "            # Move input tensors to the same device as the model\n",
+    "            input_ids = batch['input_ids'].to(device)\n",
+    "            attention_mask = batch['attention_mask'].to(device)\n",
+    "            labels = batch['label'].to(device)\n",
+    "            \n",
+    "            # Get model outputs\n",
+    "            outputs = model(input_ids=input_ids, attention_mask=attention_mask)\n",
+    "            preds = torch.argmax(outputs.logits, dim=1)\n",
+    "\n",
+    "            # Append predictions and true labels to respective lists\n",
+    "            preds_list.extend(preds.cpu().numpy())\n",
+    "            labels_list.extend(labels.cpu().numpy())\n",
+    "    \n",
+    "    # Compute evaluation metrics\n",
+    "    from sklearn.metrics import classification_report\n",
+    "    report = classification_report(labels_list, preds_list)\n",
+    "    print(\"Classification Report:\\n\", report)\n",
+    "\n",
+    "# Evaluate the model\n",
+    "evaluate_model(model, test_data_loader)"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "LLM",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.9.20"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}