YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

Roberta-Base Quantized Model for Toxic-Comment-Classification

This repository hosts a quantized version of the Roberta model, fine-tuned for Toxic-comment classification . The model has been optimized using FP16 quantization for efficient deployment without significant accuracy loss.

Model Details

  • Model Architecture: Roberta Base Uncased
  • Task: Binary Sentiment Classification (Positive/Negative)
  • Dataset: Classified_comments
  • Quantization: Float16
  • Fine-tuning Framework: Hugging Face Transformers

Installation

pip install transformers datasets scikit-learn

Loading the Model

from transformers import RobertaTokenizer, RobertaForSequenceClassification
import torch

# Load tokenizer and model
tokenizer = RobertaTokenizer.from_pretrained("roberta-base")
model = RobertaForSequenceClassification.from_pretrained("roberta-base", num_labels=2).to(device)

# Define test sentences
new_comments = [
    "I hate you so much, you are disgusting.",
    
    "What a terrible idea. Just awful.",
    "You are looking beautiful today"
]


# Tokenize and predict
def predict_comments(texts, model, tokenizer):
    # If a single string is passed, convert to list
    if isinstance(texts, str):
        texts = [texts]
    
    # Preprocess (same as training)
    def preprocess(text):
        text = text.lower()
        text = re.sub(r"http\S+|www\S+|https\S+", '', text)
        text = re.sub(r'\@\w+|\#','', text)
        text = re.sub(r"[^a-zA-Z0-9\s.,!?']", '', text)
        text = re.sub(r'\s+', ' ', text).strip()
        return text

    cleaned_texts = [preprocess(text) for text in texts]

    # Tokenize
    inputs = tokenizer(cleaned_texts, padding=True, truncation=True, return_tensors="pt").to(device)

    # Move to model's device (CPU/GPU)
    model.eval()
    with torch.no_grad():
        outputs = model(**inputs)
        predictions = torch.argmax(outputs.logits, dim=1).tolist()

    # Map predictions
    label_map = {0: "Non-Toxic", 1: "Toxic"}
    return [label_map[pred] for pred in predictions]

Performance Metrics

  • Accuracy: 0.979737
  • Precision: 0.976084
  • Recall: 0.984133
  • F1 Score: 0.980092

Fine-Tuning Details

Dataset

The dataset is sourced from Kaggle Classified_comment.csv . It contains 140000 labeled comments (Toxic or Non toxic).
The original training and testing sets were merged, shuffled, and re-split using an 80/20 ratio.

Training

  • Epochs: 3
  • Batch size: 8
  • Learning rate: 2e-5
  • Evaluation strategy: epoch

Quantization

Post-training quantization was applied using PyTorch’s half() precision (FP16) to reduce model size and inference time.


Repository Structure

.
β”œβ”€β”€ quantized-model/               # Contains the quantized model files
β”‚   β”œβ”€β”€ config.json
β”‚   β”œβ”€β”€ model.safetensors
β”‚   β”œβ”€β”€ tokenizer_config.json
β”‚   β”œβ”€β”€ vocab.txt
β”‚   └── special_tokens_map.json
β”œβ”€β”€ README.md                      # Model documentation

Limitations

  • The model is trained specifically for binary sentiment classification on Toxic comments.
  • FP16 quantization may result in slight numerical instability in edge cases.

Contributing

Feel free to open issues or submit pull requests to improve the model or documentation.

Downloads last month
1
Safetensors
Model size
125M params
Tensor type
F16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support