Aryan7500's picture
Upload 7 files
e1928df verified

DistilBERT Fine-Tuned Model for Authorship Attribution on Blog Corpus

This repository hosts a fine-tuned DistilBERT model designed for the authorship attribution task on the Blog Authorship Corpus dataset. The model is optimized for identifying the author of a given blog post from a subset of top contributors.

Model Details

  • Model Architecture: DistilBERT Base (distilbert-base-uncased)
  • Task: Authorship Attribution
  • Dataset: Blog Authorship Corpus (Top 10 authors selected)
  • Quantization: Float16 (Post-training)
  • Fine-tuning Framework: Hugging Face Transformers

Usage

Installation

pip install transformers torch

Loading the Model

from transformers import DistilBertForSequenceClassification, DistilBertTokenizerFast
import torch

# Load fine-tuned model
model_path = "fine-tuned-model"
model = DistilBertForSequenceClassification.from_pretrained(model_path)
tokenizer = DistilBertTokenizerFast.from_pretrained(model_path)

# Set model to evaluation and convert to half precision
device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)
model.eval()
model.half()

# Example input
blog_post = "Today I went to the beach and had an amazing time with friends. The sunset was breathtaking!"

# Tokenize input
inputs = tokenizer(blog_post, return_tensors="pt", padding=True, truncation=True, max_length=512).to(device)
inputs = {k: v.half() if v.dtype == torch.float else v for k, v in inputs.items()}

# Make prediction
with torch.no_grad():
    outputs = model(**inputs)

predicted_class = torch.argmax(outputs.logits, dim=1).item()

# Label mapping (example)
label_mapping = {
    0: "Author_A",
    1: "Author_B",
    2: "Author_C",
    3: "Author_D",
    4: "Author_E",
    5: "Author_F",
    6: "Author_G",
    7: "Author_H",
    8: "Author_I",
    9: "Author_J"
}

predicted_author = label_mapping[predicted_class]
print(f"Predicted Author: {predicted_author}")

Performance Metrics

  • Accuracy: ~78% (on validation set of top 10 authors)
  • Precision/Recall/F1: Vary per class, average F1 around 0.75

Fine-Tuning Details

Dataset

The model is trained on a subset of the Blog Authorship Corpus containing blogs from the top 10 most prolific authors. Each sample is a blog post with its corresponding author label.

Training

  • Epochs: 3
  • Batch size: 8
  • Evaluation strategy: Per epoch
  • Learning rate: 2e-5

Quantization

Post-training dynamic quantization using PyTorch was applied to reduce model size and accelerate inference:

quantized_model = torch.quantization.quantize_dynamic(
    model, {torch.nn.Linear}, dtype=torch.qint8
)

Repository Structure

.
β”œβ”€β”€ model/               # Contains the fine-tuned and quantized model files
β”œβ”€β”€ tokenizer_config/    # Tokenizer configuration and vocabulary
β”œβ”€β”€ model.safensors/     # Safetensors version of model weights
β”œβ”€β”€ README.md            # Documentation

Limitations

  • The model is limited to the top 10 authors used in fine-tuning.
  • May not generalize well to unseen authors or blogs outside the dataset distribution.
  • Quantization may slightly affect prediction precision.

Contributing

Contributions are welcome! If you find bugs or have suggestions for improvements, feel free to open an issue or submit a pull request.