DistilBERT Fine-Tuned Model for Authorship Attribution on Blog Corpus
This repository hosts a fine-tuned DistilBERT model designed for the authorship attribution task on the Blog Authorship Corpus dataset. The model is optimized for identifying the author of a given blog post from a subset of top contributors.
Model Details
- Model Architecture: DistilBERT Base (distilbert-base-uncased)
- Task: Authorship Attribution
- Dataset: Blog Authorship Corpus (Top 10 authors selected)
- Quantization: Float16 (Post-training)
- Fine-tuning Framework: Hugging Face Transformers
Usage
Installation
pip install transformers torch
Loading the Model
from transformers import DistilBertForSequenceClassification, DistilBertTokenizerFast
import torch
# Load fine-tuned model
model_path = "fine-tuned-model"
model = DistilBertForSequenceClassification.from_pretrained(model_path)
tokenizer = DistilBertTokenizerFast.from_pretrained(model_path)
# Set model to evaluation and convert to half precision
device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)
model.eval()
model.half()
# Example input
blog_post = "Today I went to the beach and had an amazing time with friends. The sunset was breathtaking!"
# Tokenize input
inputs = tokenizer(blog_post, return_tensors="pt", padding=True, truncation=True, max_length=512).to(device)
inputs = {k: v.half() if v.dtype == torch.float else v for k, v in inputs.items()}
# Make prediction
with torch.no_grad():
outputs = model(**inputs)
predicted_class = torch.argmax(outputs.logits, dim=1).item()
# Label mapping (example)
label_mapping = {
0: "Author_A",
1: "Author_B",
2: "Author_C",
3: "Author_D",
4: "Author_E",
5: "Author_F",
6: "Author_G",
7: "Author_H",
8: "Author_I",
9: "Author_J"
}
predicted_author = label_mapping[predicted_class]
print(f"Predicted Author: {predicted_author}")
Performance Metrics
- Accuracy: ~78% (on validation set of top 10 authors)
- Precision/Recall/F1: Vary per class, average F1 around 0.75
Fine-Tuning Details
Dataset
The model is trained on a subset of the Blog Authorship Corpus containing blogs from the top 10 most prolific authors. Each sample is a blog post with its corresponding author label.
Training
- Epochs: 3
- Batch size: 8
- Evaluation strategy: Per epoch
- Learning rate: 2e-5
Quantization
Post-training dynamic quantization using PyTorch was applied to reduce model size and accelerate inference:
quantized_model = torch.quantization.quantize_dynamic(
model, {torch.nn.Linear}, dtype=torch.qint8
)
Repository Structure
.
βββ model/ # Contains the fine-tuned and quantized model files
βββ tokenizer_config/ # Tokenizer configuration and vocabulary
βββ model.safensors/ # Safetensors version of model weights
βββ README.md # Documentation
Limitations
- The model is limited to the top 10 authors used in fine-tuning.
- May not generalize well to unseen authors or blogs outside the dataset distribution.
- Quantization may slightly affect prediction precision.
Contributing
Contributions are welcome! If you find bugs or have suggestions for improvements, feel free to open an issue or submit a pull request.