--- library_name: transformers tags: - dante - literature - italian license: cc-by-sa-4.0 datasets: - maiurilorenzo/divina-commedia language: - it base_model: - openai-community/gpt2 pipeline_tag: text-generation --- # Model Card for DanteGPT This model, **DanteGPT**, is a fine-tuned version of GPT-2 designed to generate text in the style of Dante Alighieri’s *Divina Commedia*. The model emulates Dante's poetic structure, including his use of tercets with a specific rhyme scheme (ABA BCB CDC) and thematic elements of his work, such as divine justice and moral reflection. ## Model Details ### Model Description - **Developed by:** Lorenzo Maiuri - **Funded by:** Independent research - **Shared by:** Lorenzo Maiuri - **Model type:** Fine-tuned GPT-2 - **Language(s) (NLP):** Italian (`it`) - **License:** CC BY-SA 4.0 - **Finetuned from model:** GPT-2 (base version by OpenAI) ### Model Sources - **Repository:** [Hugging Face Model Repository](https://huggingface.co/maiurilorenzo/dante-gpt) - **Dataset:** [Divina Commedia](https://huggingface.co/datasets/maiurilorenzo/divina-commedia) - **Kaggle Notebook:** [Link to Kaggle Notebook](https://www.kaggle.com/code/lorenzomaiuri/dante-gpt) - **Demo:** [DanteGPT Space](https://huggingface.co/spaces/maiurilorenzo/dante-gpt-space) ## Uses ### Try It Out You can try this model interactively using the [DanteGPT Space](https://huggingface.co/spaces/maiurilorenzo/dante-gpt-space-gpt). Simply enter a text prompt, and the model will generate verses in the style of Dante Alighieri! ### Direct Use The model is designed for generating text in the style of the *Divina Commedia* and can be used for literary exploration, creative writing, and educational purposes. ### Downstream Use Users may adapt the model for additional fine-tuning on similar literary texts or use it to generate other forms of poetic or stylistic writing. ### Out-of-Scope Use The model may produce inaccurate or nonsensical text when used outside its intended domain. It is not suitable for tasks requiring factual accuracy or ethical decision-making. ## Bias, Risks, and Limitations ### Biases - The model reflects the content and biases of the original dataset, which is a historical text. Modern ethical, cultural, and social considerations may not align with the themes or language of Dante's work. ### Risks - The model may inadvertently generate offensive or inappropriate content when prompted with ambiguous or unrelated topics. - Over-reliance on this model for literary generation without proper human oversight may lead to misrepresentation of Dante’s work. ### Recommendations Users should validate generated content for coherence and appropriateness. It is recommended to use the model in combination with literary expertise to ensure quality. ## How to Get Started with the Model To use the model for text generation, run the following code snippet: ```python from transformers import GPT2LMHeadModel, GPT2Tokenizer # Load model and tokenizer tokenizer = GPT2Tokenizer.from_pretrained("maiurilorenzo/dante-gpt") model = GPT2LMHeadModel.from_pretrained("maiurilorenzo/dante-gpt") # Generate text prompt = "Nel mezzo del cammin di nostra vita," input_ids = tokenizer.encode(prompt, return_tensors="pt") output = model.generate(input_ids, max_length=100, num_beams=5, no_repeat_ngram_size=2) print(tokenizer.decode(output[0], skip_special_tokens=True)) ``` ## Training Details ### Training Data The model was fine-tuned on the Divina Commedia dataset sourced from the Hugging Face Datasets library (`maiurilorenzo/divina-commedia`). The dataset contains cleaned and tokenized text from the original work. ### Training Procedure #### Preprocessing - Removed text exceeding 1024 tokens to ensure compatibility with GPT-2's input limits. - Split the dataset into training and test subsets. - Added special tokens `<|startoftext|>` and `<|endoftext|>` to each entry for model training. #### Training Hyperparameters Training Hyperparameters - **Training regime**: FP16 mixed precision - **Learning rate**: 2e-5 - **Batch size**: 16 (with gradient accumulation to simulate larger batch sizes) - **Epochs: 5** - **Optimizer**: AdamW - **Scheduler**: Linear warm-up with decay #### Speeds, Sizes, Times - **Training Time**: ~1.5 hours on NVIDIA Tesla P100 (16 GB) - **Model Size**: ~500 MB ## Evaluation ### Testing Data, Factors & Metrics #### Testing Data A subset of 20 samples from the dataset was held out for testing purposes. #### Factors Evaluation focused on: - Coherence of generated text. - Thematic relevance to the Divina Commedia. #### Metrics - **Human Evaluation**: Subjective assessment of the generated text's quality. ### Results - Human Evaluation: 75% accuracy in replicating Dante’s style (based on thematic and stylistic criteria). #### Summary The model successfully generates stylistically accurate text that aligns with the poetic form and thematic elements of Dante’s work. Inconsistencies in rhyme and coherence may occur in longer outputs. ## Environmental Impact Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700). - **Hardware Type:** NVIDIA Tesla P100 (16 GB) - **Hours used:** ~1.5 hours - **Cloud Provider:** Kaggle - **Carbon Emitted:** 0.21 ## Technical Specifications ### Model Architecture and Objective - **Base Model**: GPT-2 - **Objective**: Minimize cross-entropy loss between predicted and target tokens in fine-tuned training data. ### Compute Infrastructure [More Information Needed] #### Hardware - **GPU:** NVIDIA Tesla P100 (16 GB) - **RAM** 32 GB #### Software - Hugging Face Transformers - PyTorch ## Citation **BibTeX:** ``` @misc{maiurilorenzo/dante-gpt, author = {Lorenzo Maiuri}, title = {DanteGPT: Generating Text in the Style of Dante Alighieri}, year = {2024}, publisher = {Hugging Face Hub}, url = {https://huggingface.co/maiurilorenzo/dante-gpt} } ``` **APA:** [Lorenzo Maiuri]. (2024). DanteGPT: Generating Text in the Style of Dante Alighieri. Hugging Face Hub.