|
--- |
|
base_model: |
|
- unsloth/Llama-3.2-1B-Instruct |
|
library_name: transformers |
|
language: |
|
- en |
|
license: cc0-1.0 |
|
tags: |
|
- unsloth |
|
--- |
|
# A !!!!!disclaimer uh. for now, the experimentation does not lead me anywhere due to limit resources that I have and do not recommend to download this model. Working on working on it. |
|
|
|
PEFT Finnegan-tuned LLaMA 3.2-1B-instruct on part of Finnegans Wake dataset for text generation in the style of James Joyce. |
|
|
|
Space: https://huggingface.co/spaces/genaforvena/huivam_finnegans_spaceship |
|
|
|
## Iteration 3: |
|
Realized that was doing it all wrong and this tie used https://huggingface.co/unsloth/Llama-3.2-1B-Instruct and collab available from there. Only changed dataset. |
|
|
|
My collab is here: https://colab.research.google.com/drive/1JrqcU9idXXR3Wru5mw2e6Uh2TKJWwu7U?usp=sharing |
|
|
|
The only difference: Created dataset like below |
|
``` |
|
from unsloth.chat_templates import get_chat_template |
|
import json |
|
import random |
|
from transformers import AutoTokenizer |
|
from unsloth.chat_templates import get_chat_template # For chat template formatting |
|
from datasets import Dataset, load_dataset |
|
|
|
# Configuration |
|
INPUT_FILE = "finnegans_30.txt" # Path to your Finnegans Wake text file |
|
OUTPUT_FILE = "finnegans_wake_dataset.jsonl" # Local file to save the dataset |
|
CHUNK_SIZE = 24 |
|
|
|
# Apply the chat template |
|
tokenizer = get_chat_template( |
|
tokenizer, |
|
chat_template="llama-3.1", # Use the LLaMA-3.1 chat template |
|
) |
|
|
|
# Load the text |
|
with open(INPUT_FILE, "r", encoding="utf-8") as file: |
|
text = file.read() |
|
|
|
# Tokenize the text |
|
tokens = tokenizer.encode(text, truncation=False, add_special_tokens=False) |
|
|
|
# Split tokens into chunks |
|
chunks = [tokens[i:i + CHUNK_SIZE] for i in range(0, len(tokens), CHUNK_SIZE)] |
|
|
|
# Prepare dataset in conversational format |
|
dataset = [] |
|
for chunk in chunks: |
|
chunk_text = tokenizer.decode(chunk, skip_special_tokens=True) |
|
|
|
# Split the chunk into three parts randomly |
|
split_points = sorted(random.sample(range(len(chunk_text)), 2)) # Two random split points |
|
context = chunk_text[:split_points[0]] |
|
instruction = chunk_text[split_points[0]:split_points[1]] |
|
response = chunk_text[split_points[1]:] |
|
|
|
# Format as a conversation |
|
conversation = [ |
|
{"role": "user", "content": f"### GIVEN THE CONTEXT: {context} ### INSTRUCTION: {instruction}"}, |
|
{"role": "assistant", "content": response}, |
|
] |
|
|
|
# Add to dataset |
|
dataset.append({"conversations": conversation}) |
|
|
|
# Save dataset locally as a .jsonl file |
|
with open(OUTPUT_FILE, "w", encoding="utf-8") as file: |
|
for item in dataset: |
|
json.dump(item, file) |
|
file.write("\n") |
|
|
|
print(f"Dataset saved locally to {OUTPUT_FILE}") |
|
|
|
# Apply the formatting function |
|
def formatting_prompts_func(examples): |
|
convos = examples["conversations"] |
|
texts = [tokenizer.apply_chat_template(convo, tokenize=False, add_generation_prompt=False) for convo in convos] |
|
return {"text": texts} |
|
|
|
# Apply the formatting function using Dataset.from_dict |
|
dataset = Dataset.from_dict({"conversations": [d['conversations'] for d in dataset]}) |
|
|
|
formatted_dataset = dataset.map(formatting_prompts_func, batched=True, remove_columns=['conversations']) |
|
|
|
# Save the formatted dataset |
|
formatted_dataset.to_json("formatted_finnegans_wake_dataset.jsonl") |
|
print("Formatted dataset saved to formatted_finnegans_wake_dataset.jsonl") |
|
|
|
# Load the formatted dataset using load_dataset |
|
dataset = load_dataset("json", data_files="formatted_finnegans_wake_dataset.jsonl", split="train") |
|
dataset = dataset |
|
``` |
|
|
|
## Iteration 2 (Fail): |
|
|
|
Dataset: same (forgot to save config with new dataset). |
|
|
|
finnetune.yaml: |
|
``` |
|
# The ID of the dataset you created |
|
dataset: huivam-finnegans-2 |
|
|
|
# Configuration for text completion fine-tuning |
|
text_completion: |
|
# How the fields of the JSON dataset should be formatted into the input text |
|
input_template: "### GIVEN THE CONTEXT: {context} ### INSTRUCTION: {instruction} ### RESPONSE IS: " |
|
|
|
# How the fields of the JSON dataset should be formatted into the output text |
|
output_template: "ANSWER: {response}" |
|
|
|
# The Fireworks model name of the base model |
|
base_model: accounts/fireworks/models/llama-v3p2-1b-instruct |
|
``` |
|
|
|
Finne-tuning commands used: |
|
``` |
|
./firectl create dataset huivam-finnegans-2 .\finnegans_wake_dataset_2.jsonl |
|
./firectl create fine-tuning-job --settings-file finnetune.yaml --epochs=3 --learning-rate=2e-5 --batch-size=8 |
|
``` |
|
|
|
New params used to finne-tune: |
|
``` |
|
Text Completion: |
|
Input Template: ### GIVEN THE CONTEXT: {context} ### INSTRUCTION: {instruction} ### RESPONSE IS: |
|
Output Template: ANSWER: {response} |
|
Base Model: accounts/fireworks/models/llama-v3p2-1b-instruct |
|
Epochs: 3 |
|
Learning Rate: 2e-05 |
|
Lora Rank: 8 |
|
Batch Size: 8 |
|
Evaluation Split: 0 |
|
``` |
|
|
|
Spent: $0.08 |
|
Time: 5 mins |
|
|
|
## Iteration 1: |
|
|
|
Dataset I prepared like that: |
|
``` |
|
# Load the tokenizer |
|
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME) |
|
|
|
# Load the text |
|
with open(INPUT_FILE, "r", encoding="utf-8") as file: |
|
text = file.read() |
|
|
|
# Tokenize the text |
|
tokens = tokenizer.encode(text, truncation=False, add_special_tokens=False) |
|
|
|
# Split tokens into chunks |
|
chunks = [tokens[i:i + CHUNK_SIZE] for i in range(0, len(tokens), CHUNK_SIZE)] |
|
|
|
# Prepare dataset |
|
dataset = [] |
|
for chunk in chunks: |
|
chunk_text = tokenizer.decode(chunk, skip_special_tokens=True) |
|
|
|
# Split the chunk into three parts randomly |
|
split_points = sorted(random.sample(range(len(chunk_text)), 2)) # Two random split points |
|
context = chunk_text[:split_points[0]] |
|
instruction = chunk_text[split_points[0]:split_points[1]] |
|
response = chunk_text[split_points[1]:] |
|
|
|
# Add to dataset |
|
dataset.append({ |
|
"context": context, |
|
"instruction": instruction, |
|
"response": response, |
|
}) |
|
|
|
# Save dataset locally as a .jsonl file |
|
with open(OUTPUT_FILE, "w", encoding="utf-8") as file: |
|
for item in dataset: |
|
json.dump(item, file) |
|
file.write("\n") |
|
|
|
print(f"Dataset saved locally to {OUTPUT_FILE}") |
|
``` |
|
|
|
Example of dataset entry: |
|
``` |
|
{"context": "riverrun, past Eve and Adam's, from swerve of shore to bend of bay...", "instruction": "Sir Tristram, violer d'amores, fr'over the short sea...", "response": "O here here how hoth sprowled met the duskt the father of fornicationists..."} |
|
``` |
|
|
|
fine-tuned on 1/10th of text on fireworks.ai with params: |
|
``` |
|
dataset: finnegans_wake_dataset |
|
|
|
text_completion: |
|
# How the fields of the JSON dataset should be formatted into the input text |
|
input_template: "### GIVEN THE CONTEXT: {context} ### INSTRUCTION: {instruction} ### RESPONSE IS: " |
|
|
|
# How the fields of the JSON dataset should be formatted into the output text |
|
output_template: "ANSWER: {response}" |
|
|
|
# The Fireworks model name of the base model |
|
base_model: accounts/fireworks/models/llama-v3p2-1b |
|
|
|
# Hyperparameters for fine-tuning (should be passed as args and removed from here) |
|
hyperparameters: |
|
learning_rate: 1e-5 # Learning rate for the optimizer |
|
epochs: 1 # Number of epochs to train |
|
batch_size: 4 # Batch size for training |
|
``` |
|
|
|
Spent: $0.01 |
|
Time: 2 mins |
|
|
|
Result: Seemingly not enough data to affect model output. |