File size: 5,476 Bytes

29f7ee7
 
26d8ca1
 
 
 
 
 
 
 
 
 
 
 
b293209
29f7ee7
 
32b9ffa
29f7ee7
 
7c7ac28
29f7ee7
 
 
 
 
 
947c26c
 
29f7ee7
26d8ca1
 
90c53e8
26d8ca1
29f7ee7
947c26c
 
b7d87e9
947c26c
29f7ee7
 
 
5c02b8f
c5ba20f
29f7ee7
26d8ca1
29f7ee7
39d059f
29f7ee7
26d8ca1
 
29f7ee7
4c0a55d
aee7493
 
 
29f7ee7
947c26c
26d8ca1
29f7ee7
947c26c
26d8ca1
 
29f7ee7
947c26c
26d8ca1
 
1b0a758
 
aee7493
1b0a758
 
 
26d8ca1
29f7ee7
26d8ca1
a4a2790
 
32b9ffa
 
7c7ac28
32b9ffa
 
 
 
 
 
 
 
 
 
947c26c
32b9ffa
 
 
947c26c
32b9ffa
 
947c26c
32b9ffa
 
 
 
 
 
 
947c26c
32b9ffa
 
 
 
947c26c
32b9ffa
 
 
947c26c
 
32b9ffa
947c26c
32b9ffa
 
 
 
 
 
 
 
 
 
1b0a758
32b9ffa
 
 
 
 
947c26c
32b9ffa
 
947c26c
32b9ffa
 
947c26c
32b9ffa
1b0a758
32b9ffa
1b0a758
 
 
 
32b9ffa
1b0a758
 
 
 
32b9ffa
1b0a758
 
 
32b9ffa
 
 
947c26c
32b9ffa
 
 
 
 
 
 
 
 
947c26c
32b9ffa
 
947c26c
32b9ffa
 
 
b293209

---
library_name: transformers
tags:
- lyrics
- text
- text-to-lyrics
- artist-to-lyrics
- text-generation
datasets:
- smgriffin/modern-pop-lyrics
language:
- en
base_model:
- openai-community/gpt2
pipeline_tag: text-generation
---

# Model Card for pop-lyrics-generator-v1

<!-- Provide a quick summary of what the model is/does. -->
Finetuned from openai-community/gpt2 on smgriffin/modern-pop-lyrics - generates lyrics for specific pop artists. 


### Model Description

<!-- Provide a longer summary of what this model is. -->

It's pretty good at generating a song structure and stylized lyrics by artist, but bad at rhyming. Sometimes repeats the same thing over and over, but so do pop artists.
It might be good for inspiration while writing lyrics. Some of the content generated can be really silly and potentially offensive - especially if you input Lil Wayne.

- **Developed by:** Scott Griffin
- **Model type:** Generative Language
- **Language(s) (NLP):** English, Spanish
- **Finetuned from model [optional]:** openai-community/gpt2

Check out the w&b run here: [https://wandb.ai/scottgriffinm-scott-griffin-industrial-complex/pop-lyrics-generator-v1?nw=nwuserscottgriffinm](https://wandb.ai/scottgriffinm-scott-griffin-industrial-complex/pop-lyrics-generator-v1?nw=nwuserscottgriffinm)

& my blog post on making it here: [https://scottsblog.glitch.me#pop-lyrics-generator-v1](https://scottsblog.glitch.me#pop-lyrics-generator-v1)

## Uses

<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
This model is not for commercial use. The content is the property of the individual artist from which the model was finetuned.
This is for research purposes only.

## How to Use

Use the code below to generate lyrics:

```python
from transformers import GPT2LMHeadModel, GPT2Tokenizer, pipeline

# load model
model_name = "smgriffin/pop-lyrics-generator-v1"
model = GPT2LMHeadModel.from_pretrained(model_name)
tokenizer = GPT2Tokenizer.from_pretrained(model_name)

# create text generation pipeline
text_generator = pipeline("text-generation", model=model, tokenizer=tokenizer)

# prompt for justin bieber lyrics
artist_name = "Justin Bieber"
prompt = f"Artist: {artist_name}\nLyrics:"

# generate and print
generated_texts = text_generator(
    prompt, 
    max_length=150,
    num_return_sequences=1,  
    temperature=0.9,  # less than .9 results in a lot of repeated lyrics
    top_k=50,
    top_p=0.95,
    do_sample=True, 
)

print("Generated Lyrics:")
print(generated_texts[0]["generated_text"])
```


## How to Fine-Tune Your Own Lyric Generation Model

Use the code below to get finetune your own GPT2 model (for example on the smgriffin/modern-pop-lyrics dataset):

```python
import os
import pandas as pd
from datasets import load_dataset
from transformers import GPT2Tokenizer, GPT2LMHeadModel, Trainer, TrainingArguments, DataCollatorForLanguageModeling


# output directory
output_dir = "/your/output/directory"
os.makedirs(output_dir, exist_ok=True)

# load dataset
dataset = load_dataset("smgriffin/modern-pop-lyrics")

# preprocess dataset
def preprocess_function(example):
    # Combine artist name with lyrics for conditioning
    combined = [f"Artist: {artist}\nLyrics: {lyrics}\n\n" for artist, lyrics in zip(example['artist'], example['lyrics'])]
    return {"text": combined}

processed_dataset = dataset.map(preprocess_function, batched=True)

# split to train and test sets
train_test_split = processed_dataset["train"].train_test_split(test_size=0.1, seed=42)
train_dataset = train_test_split["train"]
val_dataset = train_test_split["test"]

# load tokenizer, model
model_name = "gpt2"  # Base GPT-2 model for fine-tuning
tokenizer = GPT2Tokenizer.from_pretrained(model_name)

# fill pad_token with eos_tone (gpt2 doesn't have a padding token)
tokenizer.pad_token = tokenizer.eos_token

# tokenize dataset
def tokenize_function(example):
    tokenized = tokenizer(
        example["text"],
        truncation=True,
        padding="max_length",
        max_length=512,
    )
    return {
        "input_ids": tokenized["input_ids"],
        "attention_mask": tokenized["attention_mask"],
        "labels": tokenized["input_ids"], 
    }

train_dataset = train_dataset.map(tokenize_function, batched=True, remove_columns=["artist", "lyrics", "text"])
val_dataset = val_dataset.map(tokenize_function, batched=True, remove_columns=["artist", "lyrics", "text"])

# data collator
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

# load GPT-2
model = GPT2LMHeadModel.from_pretrained(model_name)

# training arguments
training_args = TrainingArguments(
    output_dir=output_dir, 
    overwrite_output_dir=True,
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    weight_decay=0.01,
    per_device_train_batch_size=8, 
    per_device_eval_batch_size=8,
    num_train_epochs=10,  
    save_steps=1000,
    save_total_limit=1,  
    logging_dir=f"{output_dir}/logs", 
    logging_steps=50,
    gradient_accumulation_steps=2,  
    fp16=True, 
    max_grad_norm=1.0,
    push_to_hub=False,
)

# init trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    tokenizer=tokenizer,
    data_collator=data_collator,
)

# start fine-tuning
trainer.train()

# save model
trainer.save_model(output_dir)
tokenizer.save_pretrained(output_dir)

```