# DistilBERT Fine-Tuned Model for Authorship Attribution on Blog Corpus This repository hosts a fine-tuned DistilBERT model designed for the **authorship attribution** task on the Blog Authorship Corpus dataset. The model is optimized for identifying the author of a given blog post from a subset of top contributors. ## Model Details - **Model Architecture:** DistilBERT Base (distilbert-base-uncased) - **Task:** Authorship Attribution - **Dataset:** Blog Authorship Corpus (Top 10 authors selected) - **Quantization:** Float16 (Post-training) - **Fine-tuning Framework:** Hugging Face Transformers ## Usage ### Installation ```sh pip install transformers torch ``` ### Loading the Model ```python from transformers import DistilBertForSequenceClassification, DistilBertTokenizerFast import torch # Load fine-tuned model model_path = "fine-tuned-model" model = DistilBertForSequenceClassification.from_pretrained(model_path) tokenizer = DistilBertTokenizerFast.from_pretrained(model_path) # Set model to evaluation and convert to half precision device = "cuda" if torch.cuda.is_available() else "cpu" model.to(device) model.eval() model.half() # Example input blog_post = "Today I went to the beach and had an amazing time with friends. The sunset was breathtaking!" # Tokenize input inputs = tokenizer(blog_post, return_tensors="pt", padding=True, truncation=True, max_length=512).to(device) inputs = {k: v.half() if v.dtype == torch.float else v for k, v in inputs.items()} # Make prediction with torch.no_grad(): outputs = model(**inputs) predicted_class = torch.argmax(outputs.logits, dim=1).item() # Label mapping (example) label_mapping = { 0: "Author_A", 1: "Author_B", 2: "Author_C", 3: "Author_D", 4: "Author_E", 5: "Author_F", 6: "Author_G", 7: "Author_H", 8: "Author_I", 9: "Author_J" } predicted_author = label_mapping[predicted_class] print(f"Predicted Author: {predicted_author}") ``` ## Performance Metrics - **Accuracy:** ~78% (on validation set of top 10 authors) - **Precision/Recall/F1:** Vary per class, average F1 around 0.75 ## Fine-Tuning Details ### Dataset The model is trained on a subset of the **Blog Authorship Corpus** containing blogs from the top 10 most prolific authors. Each sample is a blog post with its corresponding author label. ### Training - **Epochs:** 3 - **Batch size:** 8 - **Evaluation strategy:** Per epoch - **Learning rate:** 2e-5 ### Quantization Post-training dynamic quantization using PyTorch was applied to reduce model size and accelerate inference: ```python quantized_model = torch.quantization.quantize_dynamic( model, {torch.nn.Linear}, dtype=torch.qint8 ) ``` ## Repository Structure ``` . ├── model/ # Contains the fine-tuned and quantized model files ├── tokenizer_config/ # Tokenizer configuration and vocabulary ├── model.safensors/ # Safetensors version of model weights ├── README.md # Documentation ``` ## Limitations - The model is limited to the top 10 authors used in fine-tuning. - May not generalize well to unseen authors or blogs outside the dataset distribution. - Quantization may slightly affect prediction precision. ## Contributing Contributions are welcome! If you find bugs or have suggestions for improvements, feel free to open an issue or submit a pull request.