File size: 5,476 Bytes
29f7ee7
 
26d8ca1
 
 
 
 
 
 
 
 
 
 
 
b293209
29f7ee7
 
32b9ffa
29f7ee7
 
7c7ac28
29f7ee7
 
 
 
 
 
947c26c
 
29f7ee7
26d8ca1
 
90c53e8
26d8ca1
29f7ee7
947c26c
 
b7d87e9
947c26c
29f7ee7
 
 
5c02b8f
c5ba20f
29f7ee7
26d8ca1
29f7ee7
39d059f
29f7ee7
26d8ca1
 
29f7ee7
4c0a55d
aee7493
 
 
29f7ee7
947c26c
26d8ca1
29f7ee7
947c26c
26d8ca1
 
29f7ee7
947c26c
26d8ca1
 
1b0a758
 
aee7493
1b0a758
 
 
26d8ca1
29f7ee7
26d8ca1
a4a2790
 
32b9ffa
 
7c7ac28
32b9ffa
 
 
 
 
 
 
 
 
 
947c26c
32b9ffa
 
 
947c26c
32b9ffa
 
947c26c
32b9ffa
 
 
 
 
 
 
947c26c
32b9ffa
 
 
 
947c26c
32b9ffa
 
 
947c26c
 
32b9ffa
947c26c
32b9ffa
 
 
 
 
 
 
 
 
 
1b0a758
32b9ffa
 
 
 
 
947c26c
32b9ffa
 
947c26c
32b9ffa
 
947c26c
32b9ffa
1b0a758
32b9ffa
1b0a758
 
 
 
32b9ffa
1b0a758
 
 
 
32b9ffa
1b0a758
 
 
32b9ffa
 
 
947c26c
32b9ffa
 
 
 
 
 
 
 
 
947c26c
32b9ffa
 
947c26c
32b9ffa
 
 
b293209
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
---
library_name: transformers
tags:
- lyrics
- text
- text-to-lyrics
- artist-to-lyrics
- text-generation
datasets:
- smgriffin/modern-pop-lyrics
language:
- en
base_model:
- openai-community/gpt2
pipeline_tag: text-generation
---

# Model Card for pop-lyrics-generator-v1

<!-- Provide a quick summary of what the model is/does. -->
Finetuned from openai-community/gpt2 on smgriffin/modern-pop-lyrics - generates lyrics for specific pop artists. 


### Model Description

<!-- Provide a longer summary of what this model is. -->

It's pretty good at generating a song structure and stylized lyrics by artist, but bad at rhyming. Sometimes repeats the same thing over and over, but so do pop artists.
It might be good for inspiration while writing lyrics. Some of the content generated can be really silly and potentially offensive - especially if you input Lil Wayne.

- **Developed by:** Scott Griffin
- **Model type:** Generative Language
- **Language(s) (NLP):** English, Spanish
- **Finetuned from model [optional]:** openai-community/gpt2

Check out the w&b run here: [https://wandb.ai/scottgriffinm-scott-griffin-industrial-complex/pop-lyrics-generator-v1?nw=nwuserscottgriffinm](https://wandb.ai/scottgriffinm-scott-griffin-industrial-complex/pop-lyrics-generator-v1?nw=nwuserscottgriffinm)

& my blog post on making it here: [https://scottsblog.glitch.me#pop-lyrics-generator-v1](https://scottsblog.glitch.me#pop-lyrics-generator-v1)

## Uses

<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
This model is not for commercial use. The content is the property of the individual artist from which the model was finetuned.
This is for research purposes only.

## How to Use

Use the code below to generate lyrics:

```python
from transformers import GPT2LMHeadModel, GPT2Tokenizer, pipeline

# load model
model_name = "smgriffin/pop-lyrics-generator-v1"
model = GPT2LMHeadModel.from_pretrained(model_name)
tokenizer = GPT2Tokenizer.from_pretrained(model_name)

# create text generation pipeline
text_generator = pipeline("text-generation", model=model, tokenizer=tokenizer)

# prompt for justin bieber lyrics
artist_name = "Justin Bieber"
prompt = f"Artist: {artist_name}\nLyrics:"

# generate and print
generated_texts = text_generator(
    prompt, 
    max_length=150,
    num_return_sequences=1,  
    temperature=0.9,  # less than .9 results in a lot of repeated lyrics
    top_k=50,
    top_p=0.95,
    do_sample=True, 
)

print("Generated Lyrics:")
print(generated_texts[0]["generated_text"])
```


## How to Fine-Tune Your Own Lyric Generation Model

Use the code below to get finetune your own GPT2 model (for example on the smgriffin/modern-pop-lyrics dataset):

```python
import os
import pandas as pd
from datasets import load_dataset
from transformers import GPT2Tokenizer, GPT2LMHeadModel, Trainer, TrainingArguments, DataCollatorForLanguageModeling


# output directory
output_dir = "/your/output/directory"
os.makedirs(output_dir, exist_ok=True)

# load dataset
dataset = load_dataset("smgriffin/modern-pop-lyrics")

# preprocess dataset
def preprocess_function(example):
    # Combine artist name with lyrics for conditioning
    combined = [f"Artist: {artist}\nLyrics: {lyrics}\n\n" for artist, lyrics in zip(example['artist'], example['lyrics'])]
    return {"text": combined}

processed_dataset = dataset.map(preprocess_function, batched=True)

# split to train and test sets
train_test_split = processed_dataset["train"].train_test_split(test_size=0.1, seed=42)
train_dataset = train_test_split["train"]
val_dataset = train_test_split["test"]

# load tokenizer, model
model_name = "gpt2"  # Base GPT-2 model for fine-tuning
tokenizer = GPT2Tokenizer.from_pretrained(model_name)

# fill pad_token with eos_tone (gpt2 doesn't have a padding token)
tokenizer.pad_token = tokenizer.eos_token

# tokenize dataset
def tokenize_function(example):
    tokenized = tokenizer(
        example["text"],
        truncation=True,
        padding="max_length",
        max_length=512,
    )
    return {
        "input_ids": tokenized["input_ids"],
        "attention_mask": tokenized["attention_mask"],
        "labels": tokenized["input_ids"], 
    }

train_dataset = train_dataset.map(tokenize_function, batched=True, remove_columns=["artist", "lyrics", "text"])
val_dataset = val_dataset.map(tokenize_function, batched=True, remove_columns=["artist", "lyrics", "text"])

# data collator
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

# load GPT-2
model = GPT2LMHeadModel.from_pretrained(model_name)

# training arguments
training_args = TrainingArguments(
    output_dir=output_dir, 
    overwrite_output_dir=True,
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    weight_decay=0.01,
    per_device_train_batch_size=8, 
    per_device_eval_batch_size=8,
    num_train_epochs=10,  
    save_steps=1000,
    save_total_limit=1,  
    logging_dir=f"{output_dir}/logs", 
    logging_steps=50,
    gradient_accumulation_steps=2,  
    fp16=True, 
    max_grad_norm=1.0,
    push_to_hub=False,
)

# init trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    tokenizer=tokenizer,
    data_collator=data_collator,
)

# start fine-tuning
trainer.train()

# save model
trainer.save_model(output_dir)
tokenizer.save_pretrained(output_dir)

```