DistilBERT Fine-Tuned Model for Authorship Attribution on Blog Corpus

This repository hosts a fine-tuned DistilBERT model designed for the authorship attribution task on the Blog Authorship Corpus dataset. The model is optimized for identifying the author of a given blog post from a subset of top contributors.

Model Details

Model Architecture: DistilBERT Base (distilbert-base-uncased)
Task: Authorship Attribution
Dataset: Blog Authorship Corpus (Top 10 authors selected)
Quantization: Float16 (Post-training)
Fine-tuning Framework: Hugging Face Transformers

Usage

Installation

pip install transformers torch

Loading the Model

from transformers import DistilBertForSequenceClassification, DistilBertTokenizerFast
import torch

# Load fine-tuned model
model_path = "fine-tuned-model"
model = DistilBertForSequenceClassification.from_pretrained(model_path)
tokenizer = DistilBertTokenizerFast.from_pretrained(model_path)

# Set model to evaluation and convert to half precision
device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)
model.eval()
model.half()

# Example input
blog_post = "Today I went to the beach and had an amazing time with friends. The sunset was breathtaking!"

# Tokenize input
inputs = tokenizer(blog_post, return_tensors="pt", padding=True, truncation=True, max_length=512).to(device)
inputs = {k: v.half() if v.dtype == torch.float else v for k, v in inputs.items()}

# Make prediction
with torch.no_grad():
    outputs = model(**inputs)

predicted_class = torch.argmax(outputs.logits, dim=1).item()

# Label mapping (example)
label_mapping = {
    0: "Author_A",
    1: "Author_B",
    2: "Author_C",
    3: "Author_D",
    4: "Author_E",
    5: "Author_F",
    6: "Author_G",
    7: "Author_H",
    8: "Author_I",
    9: "Author_J"
}

predicted_author = label_mapping[predicted_class]
print(f"Predicted Author: {predicted_author}")

Performance Metrics

Accuracy: ~78% (on validation set of top 10 authors)
Precision/Recall/F1: Vary per class, average F1 around 0.75

Fine-Tuning Details

Dataset

The model is trained on a subset of the Blog Authorship Corpus containing blogs from the top 10 most prolific authors. Each sample is a blog post with its corresponding author label.

Training

Epochs: 3
Batch size: 8
Evaluation strategy: Per epoch
Learning rate: 2e-5

Quantization

Post-training dynamic quantization using PyTorch was applied to reduce model size and accelerate inference:

quantized_model = torch.quantization.quantize_dynamic(
    model, {torch.nn.Linear}, dtype=torch.qint8
)

Repository Structure

.
├── model/               # Contains the fine-tuned and quantized model files
├── tokenizer_config/    # Tokenizer configuration and vocabulary
├── model.safensors/     # Safetensors version of model weights
├── README.md            # Documentation

Limitations

The model is limited to the top 10 authors used in fine-tuning.
May not generalize well to unseen authors or blogs outside the dataset distribution.
Quantization may slightly affect prediction precision.

Contributing

Contributions are welcome! If you find bugs or have suggestions for improvements, feel free to open an issue or submit a pull request.