File size: 3,404 Bytes

e1928df


# DistilBERT Fine-Tuned Model for Authorship Attribution on Blog Corpus

This repository hosts a fine-tuned DistilBERT model designed for the **authorship attribution** task on the Blog Authorship Corpus dataset. The model is optimized for identifying the author of a given blog post from a subset of top contributors.

## Model Details

- **Model Architecture:** DistilBERT Base (distilbert-base-uncased)  
- **Task:** Authorship Attribution  
- **Dataset:** Blog Authorship Corpus (Top 10 authors selected)  
- **Quantization:** Float16 (Post-training)  
- **Fine-tuning Framework:** Hugging Face Transformers  

## Usage

### Installation

```sh
pip install transformers torch
```

### Loading the Model

```python
from transformers import DistilBertForSequenceClassification, DistilBertTokenizerFast
import torch

# Load fine-tuned model
model_path = "fine-tuned-model"
model = DistilBertForSequenceClassification.from_pretrained(model_path)
tokenizer = DistilBertTokenizerFast.from_pretrained(model_path)

# Set model to evaluation and convert to half precision
device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)
model.eval()
model.half()

# Example input
blog_post = "Today I went to the beach and had an amazing time with friends. The sunset was breathtaking!"

# Tokenize input
inputs = tokenizer(blog_post, return_tensors="pt", padding=True, truncation=True, max_length=512).to(device)
inputs = {k: v.half() if v.dtype == torch.float else v for k, v in inputs.items()}

# Make prediction
with torch.no_grad():
    outputs = model(**inputs)

predicted_class = torch.argmax(outputs.logits, dim=1).item()

# Label mapping (example)
label_mapping = {
    0: "Author_A",
    1: "Author_B",
    2: "Author_C",
    3: "Author_D",
    4: "Author_E",
    5: "Author_F",
    6: "Author_G",
    7: "Author_H",
    8: "Author_I",
    9: "Author_J"
}

predicted_author = label_mapping[predicted_class]
print(f"Predicted Author: {predicted_author}")
```

## Performance Metrics

- **Accuracy:** ~78% (on validation set of top 10 authors)
- **Precision/Recall/F1:** Vary per class, average F1 around 0.75

## Fine-Tuning Details

### Dataset

The model is trained on a subset of the **Blog Authorship Corpus** containing blogs from the top 10 most prolific authors. Each sample is a blog post with its corresponding author label.

### Training

- **Epochs:** 3  
- **Batch size:** 8  
- **Evaluation strategy:** Per epoch  
- **Learning rate:** 2e-5  

### Quantization

Post-training dynamic quantization using PyTorch was applied to reduce model size and accelerate inference:

```python
quantized_model = torch.quantization.quantize_dynamic(
    model, {torch.nn.Linear}, dtype=torch.qint8
)
```

## Repository Structure

```
.
├── model/               # Contains the fine-tuned and quantized model files
├── tokenizer_config/    # Tokenizer configuration and vocabulary
├── model.safensors/     # Safetensors version of model weights
├── README.md            # Documentation
```

## Limitations

- The model is limited to the top 10 authors used in fine-tuning.
- May not generalize well to unseen authors or blogs outside the dataset distribution.
- Quantization may slightly affect prediction precision.

## Contributing

Contributions are welcome! If you find bugs or have suggestions for improvements, feel free to open an issue or submit a pull request.