Roberta-Base Quantized Model for Toxic-Comment-Classification

This repository hosts a quantized version of the Roberta model, fine-tuned for Toxic-comment classification . The model has been optimized using FP16 quantization for efficient deployment without significant accuracy loss.

Model Details

Model Architecture: Roberta Base Uncased
Task: Binary Sentiment Classification (Positive/Negative)
Dataset: Classified_comments
Quantization: Float16
Fine-tuning Framework: Hugging Face Transformers

Installation

pip install transformers datasets scikit-learn

Loading the Model

from transformers import RobertaTokenizer, RobertaForSequenceClassification
import torch

# Load tokenizer and model
tokenizer = RobertaTokenizer.from_pretrained("roberta-base")
model = RobertaForSequenceClassification.from_pretrained("roberta-base", num_labels=2).to(device)

# Define test sentences
new_comments = [
    "I hate you so much, you are disgusting.",
    
    "What a terrible idea. Just awful.",
    "You are looking beautiful today"
]


# Tokenize and predict
def predict_comments(texts, model, tokenizer):
    # If a single string is passed, convert to list
    if isinstance(texts, str):
        texts = [texts]
    
    # Preprocess (same as training)
    def preprocess(text):
        text = text.lower()
        text = re.sub(r"http\S+|www\S+|https\S+", '', text)
        text = re.sub(r'\@\w+|\#','', text)
        text = re.sub(r"[^a-zA-Z0-9\s.,!?']", '', text)
        text = re.sub(r'\s+', ' ', text).strip()
        return text

    cleaned_texts = [preprocess(text) for text in texts]

    # Tokenize
    inputs = tokenizer(cleaned_texts, padding=True, truncation=True, return_tensors="pt").to(device)

    # Move to model's device (CPU/GPU)
    model.eval()
    with torch.no_grad():
        outputs = model(**inputs)
        predictions = torch.argmax(outputs.logits, dim=1).tolist()

    # Map predictions
    label_map = {0: "Non-Toxic", 1: "Toxic"}
    return [label_map[pred] for pred in predictions]

Performance Metrics

Accuracy: 0.979737
Precision: 0.976084
Recall: 0.984133
F1 Score: 0.980092

Fine-Tuning Details

Dataset

The dataset is sourced from Kaggle Classified_comment.csv . It contains 140000 labeled comments (Toxic or Non toxic).
The original training and testing sets were merged, shuffled, and re-split using an 80/20 ratio.

Training

Epochs: 3
Batch size: 8
Learning rate: 2e-5
Evaluation strategy: epoch

Quantization

Post-training quantization was applied using PyTorch’s half() precision (FP16) to reduce model size and inference time.

Repository Structure

.
├── quantized-model/               # Contains the quantized model files
│   ├── config.json
│   ├── model.safetensors
│   ├── tokenizer_config.json
│   ├── vocab.txt
│   └── special_tokens_map.json
├── README.md                      # Model documentation

Limitations

The model is trained specifically for binary sentiment classification on Toxic comments.
FP16 quantization may result in slight numerical instability in edge cases.

Contributing

Feel free to open issues or submit pull requests to improve the model or documentation.