|
|
|
# DistilBERT Fine-Tuned Model for Authorship Attribution on Blog Corpus |
|
|
|
This repository hosts a fine-tuned DistilBERT model designed for the **authorship attribution** task on the Blog Authorship Corpus dataset. The model is optimized for identifying the author of a given blog post from a subset of top contributors. |
|
|
|
## Model Details |
|
|
|
- **Model Architecture:** DistilBERT Base (distilbert-base-uncased) |
|
- **Task:** Authorship Attribution |
|
- **Dataset:** Blog Authorship Corpus (Top 10 authors selected) |
|
- **Quantization:** Float16 (Post-training) |
|
- **Fine-tuning Framework:** Hugging Face Transformers |
|
|
|
## Usage |
|
|
|
### Installation |
|
|
|
```sh |
|
pip install transformers torch |
|
``` |
|
|
|
### Loading the Model |
|
|
|
```python |
|
from transformers import DistilBertForSequenceClassification, DistilBertTokenizerFast |
|
import torch |
|
|
|
# Load fine-tuned model |
|
model_path = "fine-tuned-model" |
|
model = DistilBertForSequenceClassification.from_pretrained(model_path) |
|
tokenizer = DistilBertTokenizerFast.from_pretrained(model_path) |
|
|
|
# Set model to evaluation and convert to half precision |
|
device = "cuda" if torch.cuda.is_available() else "cpu" |
|
model.to(device) |
|
model.eval() |
|
model.half() |
|
|
|
# Example input |
|
blog_post = "Today I went to the beach and had an amazing time with friends. The sunset was breathtaking!" |
|
|
|
# Tokenize input |
|
inputs = tokenizer(blog_post, return_tensors="pt", padding=True, truncation=True, max_length=512).to(device) |
|
inputs = {k: v.half() if v.dtype == torch.float else v for k, v in inputs.items()} |
|
|
|
# Make prediction |
|
with torch.no_grad(): |
|
outputs = model(**inputs) |
|
|
|
predicted_class = torch.argmax(outputs.logits, dim=1).item() |
|
|
|
# Label mapping (example) |
|
label_mapping = { |
|
0: "Author_A", |
|
1: "Author_B", |
|
2: "Author_C", |
|
3: "Author_D", |
|
4: "Author_E", |
|
5: "Author_F", |
|
6: "Author_G", |
|
7: "Author_H", |
|
8: "Author_I", |
|
9: "Author_J" |
|
} |
|
|
|
predicted_author = label_mapping[predicted_class] |
|
print(f"Predicted Author: {predicted_author}") |
|
``` |
|
|
|
## Performance Metrics |
|
|
|
- **Accuracy:** ~78% (on validation set of top 10 authors) |
|
- **Precision/Recall/F1:** Vary per class, average F1 around 0.75 |
|
|
|
## Fine-Tuning Details |
|
|
|
### Dataset |
|
|
|
The model is trained on a subset of the **Blog Authorship Corpus** containing blogs from the top 10 most prolific authors. Each sample is a blog post with its corresponding author label. |
|
|
|
### Training |
|
|
|
- **Epochs:** 3 |
|
- **Batch size:** 8 |
|
- **Evaluation strategy:** Per epoch |
|
- **Learning rate:** 2e-5 |
|
|
|
### Quantization |
|
|
|
Post-training dynamic quantization using PyTorch was applied to reduce model size and accelerate inference: |
|
|
|
```python |
|
quantized_model = torch.quantization.quantize_dynamic( |
|
model, {torch.nn.Linear}, dtype=torch.qint8 |
|
) |
|
``` |
|
|
|
## Repository Structure |
|
|
|
``` |
|
. |
|
βββ model/ # Contains the fine-tuned and quantized model files |
|
βββ tokenizer_config/ # Tokenizer configuration and vocabulary |
|
βββ model.safensors/ # Safetensors version of model weights |
|
βββ README.md # Documentation |
|
``` |
|
|
|
## Limitations |
|
|
|
- The model is limited to the top 10 authors used in fine-tuning. |
|
- May not generalize well to unseen authors or blogs outside the dataset distribution. |
|
- Quantization may slightly affect prediction precision. |
|
|
|
## Contributing |
|
|
|
Contributions are welcome! If you find bugs or have suggestions for improvements, feel free to open an issue or submit a pull request. |
|
|