Aryan7500's picture
Upload 7 files
e1928df verified
# DistilBERT Fine-Tuned Model for Authorship Attribution on Blog Corpus
This repository hosts a fine-tuned DistilBERT model designed for the **authorship attribution** task on the Blog Authorship Corpus dataset. The model is optimized for identifying the author of a given blog post from a subset of top contributors.
## Model Details
- **Model Architecture:** DistilBERT Base (distilbert-base-uncased)
- **Task:** Authorship Attribution
- **Dataset:** Blog Authorship Corpus (Top 10 authors selected)
- **Quantization:** Float16 (Post-training)
- **Fine-tuning Framework:** Hugging Face Transformers
## Usage
### Installation
```sh
pip install transformers torch
```
### Loading the Model
```python
from transformers import DistilBertForSequenceClassification, DistilBertTokenizerFast
import torch
# Load fine-tuned model
model_path = "fine-tuned-model"
model = DistilBertForSequenceClassification.from_pretrained(model_path)
tokenizer = DistilBertTokenizerFast.from_pretrained(model_path)
# Set model to evaluation and convert to half precision
device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)
model.eval()
model.half()
# Example input
blog_post = "Today I went to the beach and had an amazing time with friends. The sunset was breathtaking!"
# Tokenize input
inputs = tokenizer(blog_post, return_tensors="pt", padding=True, truncation=True, max_length=512).to(device)
inputs = {k: v.half() if v.dtype == torch.float else v for k, v in inputs.items()}
# Make prediction
with torch.no_grad():
outputs = model(**inputs)
predicted_class = torch.argmax(outputs.logits, dim=1).item()
# Label mapping (example)
label_mapping = {
0: "Author_A",
1: "Author_B",
2: "Author_C",
3: "Author_D",
4: "Author_E",
5: "Author_F",
6: "Author_G",
7: "Author_H",
8: "Author_I",
9: "Author_J"
}
predicted_author = label_mapping[predicted_class]
print(f"Predicted Author: {predicted_author}")
```
## Performance Metrics
- **Accuracy:** ~78% (on validation set of top 10 authors)
- **Precision/Recall/F1:** Vary per class, average F1 around 0.75
## Fine-Tuning Details
### Dataset
The model is trained on a subset of the **Blog Authorship Corpus** containing blogs from the top 10 most prolific authors. Each sample is a blog post with its corresponding author label.
### Training
- **Epochs:** 3
- **Batch size:** 8
- **Evaluation strategy:** Per epoch
- **Learning rate:** 2e-5
### Quantization
Post-training dynamic quantization using PyTorch was applied to reduce model size and accelerate inference:
```python
quantized_model = torch.quantization.quantize_dynamic(
model, {torch.nn.Linear}, dtype=torch.qint8
)
```
## Repository Structure
```
.
β”œβ”€β”€ model/ # Contains the fine-tuned and quantized model files
β”œβ”€β”€ tokenizer_config/ # Tokenizer configuration and vocabulary
β”œβ”€β”€ model.safensors/ # Safetensors version of model weights
β”œβ”€β”€ README.md # Documentation
```
## Limitations
- The model is limited to the top 10 authors used in fine-tuning.
- May not generalize well to unseen authors or blogs outside the dataset distribution.
- Quantization may slightly affect prediction precision.
## Contributing
Contributions are welcome! If you find bugs or have suggestions for improvements, feel free to open an issue or submit a pull request.