File size: 3,404 Bytes
e1928df |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 |
# DistilBERT Fine-Tuned Model for Authorship Attribution on Blog Corpus
This repository hosts a fine-tuned DistilBERT model designed for the **authorship attribution** task on the Blog Authorship Corpus dataset. The model is optimized for identifying the author of a given blog post from a subset of top contributors.
## Model Details
- **Model Architecture:** DistilBERT Base (distilbert-base-uncased)
- **Task:** Authorship Attribution
- **Dataset:** Blog Authorship Corpus (Top 10 authors selected)
- **Quantization:** Float16 (Post-training)
- **Fine-tuning Framework:** Hugging Face Transformers
## Usage
### Installation
```sh
pip install transformers torch
```
### Loading the Model
```python
from transformers import DistilBertForSequenceClassification, DistilBertTokenizerFast
import torch
# Load fine-tuned model
model_path = "fine-tuned-model"
model = DistilBertForSequenceClassification.from_pretrained(model_path)
tokenizer = DistilBertTokenizerFast.from_pretrained(model_path)
# Set model to evaluation and convert to half precision
device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)
model.eval()
model.half()
# Example input
blog_post = "Today I went to the beach and had an amazing time with friends. The sunset was breathtaking!"
# Tokenize input
inputs = tokenizer(blog_post, return_tensors="pt", padding=True, truncation=True, max_length=512).to(device)
inputs = {k: v.half() if v.dtype == torch.float else v for k, v in inputs.items()}
# Make prediction
with torch.no_grad():
outputs = model(**inputs)
predicted_class = torch.argmax(outputs.logits, dim=1).item()
# Label mapping (example)
label_mapping = {
0: "Author_A",
1: "Author_B",
2: "Author_C",
3: "Author_D",
4: "Author_E",
5: "Author_F",
6: "Author_G",
7: "Author_H",
8: "Author_I",
9: "Author_J"
}
predicted_author = label_mapping[predicted_class]
print(f"Predicted Author: {predicted_author}")
```
## Performance Metrics
- **Accuracy:** ~78% (on validation set of top 10 authors)
- **Precision/Recall/F1:** Vary per class, average F1 around 0.75
## Fine-Tuning Details
### Dataset
The model is trained on a subset of the **Blog Authorship Corpus** containing blogs from the top 10 most prolific authors. Each sample is a blog post with its corresponding author label.
### Training
- **Epochs:** 3
- **Batch size:** 8
- **Evaluation strategy:** Per epoch
- **Learning rate:** 2e-5
### Quantization
Post-training dynamic quantization using PyTorch was applied to reduce model size and accelerate inference:
```python
quantized_model = torch.quantization.quantize_dynamic(
model, {torch.nn.Linear}, dtype=torch.qint8
)
```
## Repository Structure
```
.
βββ model/ # Contains the fine-tuned and quantized model files
βββ tokenizer_config/ # Tokenizer configuration and vocabulary
βββ model.safensors/ # Safetensors version of model weights
βββ README.md # Documentation
```
## Limitations
- The model is limited to the top 10 authors used in fine-tuning.
- May not generalize well to unseen authors or blogs outside the dataset distribution.
- Quantization may slightly affect prediction precision.
## Contributing
Contributions are welcome! If you find bugs or have suggestions for improvements, feel free to open an issue or submit a pull request.
|