YAML Metadata
Warning:
empty or missing yaml metadata in repo card
(https://huggingface.co/docs/hub/model-cards#model-card-metadata)
Roberta-Base Quantized Model for Toxic-Comment-Classification
This repository hosts a quantized version of the Roberta model, fine-tuned for Toxic-comment classification . The model has been optimized using FP16 quantization for efficient deployment without significant accuracy loss.
Model Details
- Model Architecture: Roberta Base Uncased
- Task: Binary Sentiment Classification (Positive/Negative)
- Dataset: Classified_comments
- Quantization: Float16
- Fine-tuning Framework: Hugging Face Transformers
Installation
pip install transformers datasets scikit-learn
Loading the Model
from transformers import RobertaTokenizer, RobertaForSequenceClassification
import torch
# Load tokenizer and model
tokenizer = RobertaTokenizer.from_pretrained("roberta-base")
model = RobertaForSequenceClassification.from_pretrained("roberta-base", num_labels=2).to(device)
# Define test sentences
new_comments = [
"I hate you so much, you are disgusting.",
"What a terrible idea. Just awful.",
"You are looking beautiful today"
]
# Tokenize and predict
def predict_comments(texts, model, tokenizer):
# If a single string is passed, convert to list
if isinstance(texts, str):
texts = [texts]
# Preprocess (same as training)
def preprocess(text):
text = text.lower()
text = re.sub(r"http\S+|www\S+|https\S+", '', text)
text = re.sub(r'\@\w+|\#','', text)
text = re.sub(r"[^a-zA-Z0-9\s.,!?']", '', text)
text = re.sub(r'\s+', ' ', text).strip()
return text
cleaned_texts = [preprocess(text) for text in texts]
# Tokenize
inputs = tokenizer(cleaned_texts, padding=True, truncation=True, return_tensors="pt").to(device)
# Move to model's device (CPU/GPU)
model.eval()
with torch.no_grad():
outputs = model(**inputs)
predictions = torch.argmax(outputs.logits, dim=1).tolist()
# Map predictions
label_map = {0: "Non-Toxic", 1: "Toxic"}
return [label_map[pred] for pred in predictions]
Performance Metrics
- Accuracy: 0.979737
- Precision: 0.976084
- Recall: 0.984133
- F1 Score: 0.980092
Fine-Tuning Details
Dataset
The dataset is sourced from Kaggle Classified_comment.csv . It contains 140000 labeled comments (Toxic or Non toxic).
The original training and testing sets were merged, shuffled, and re-split using an 80/20 ratio.
Training
- Epochs: 3
- Batch size: 8
- Learning rate: 2e-5
- Evaluation strategy:
epoch
Quantization
Post-training quantization was applied using PyTorchβs half()
precision (FP16) to reduce model size and inference time.
Repository Structure
.
βββ quantized-model/ # Contains the quantized model files
β βββ config.json
β βββ model.safetensors
β βββ tokenizer_config.json
β βββ vocab.txt
β βββ special_tokens_map.json
βββ README.md # Model documentation
Limitations
- The model is trained specifically for binary sentiment classification on Toxic comments.
- FP16 quantization may result in slight numerical instability in edge cases.
Contributing
Feel free to open issues or submit pull requests to improve the model or documentation.
- Downloads last month
- 1
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
π
Ask for provider support