genaforvena
/

huivam_finnegan_llama3.2-1b

@@ -1,10 +1,12 @@
 ---
 base_model:
 - unsloth/Llama-3.2-1B-Instruct
-library_name: peft
 language:
 - en
 license: cc0-1.0
 ---
 # A !!!!!disclaimer uh. for now, the experimentation does not lead me anywhere due to limit resources that I have and do not recommend to download this model. Working on working on it.
@@ -12,7 +14,90 @@ PEFT Finnegan-tuned LLaMA 3.2-1B-instruct on part of Finnegans Wake dataset for
 Space: https://huggingface.co/spaces/genaforvena/huivam_finnegans_spaceship
-## Iteration 2:
 Dataset: same (forgot to save config with new dataset).

 ---
 base_model:
 - unsloth/Llama-3.2-1B-Instruct
+library_name: transformers
 language:
 - en
 license: cc0-1.0
+tags:
+- unsloth
 ---
 # A !!!!!disclaimer uh. for now, the experimentation does not lead me anywhere due to limit resources that I have and do not recommend to download this model. Working on working on it.
 Space: https://huggingface.co/spaces/genaforvena/huivam_finnegans_spaceship
+## Iteration 3:
+Realized that was doing it all wrong and this tie used https://huggingface.co/unsloth/Llama-3.2-1B-Instruct and collab available from there. Only changed dataset.
+My collab is here: https://colab.research.google.com/drive/1JrqcU9idXXR3Wru5mw2e6Uh2TKJWwu7U?usp=sharing
+The only difference: Created dataset like below
+```
+from unsloth.chat_templates import get_chat_template
+import json
+import random
+from transformers import AutoTokenizer
+from unsloth.chat_templates import get_chat_template  # For chat template formatting
+from datasets import Dataset, load_dataset
+# Configuration
+INPUT_FILE = "finnegans_30.txt"  # Path to your Finnegans Wake text file
+OUTPUT_FILE = "finnegans_wake_dataset.jsonl"  # Local file to save the dataset
+CHUNK_SIZE = 24
+# Apply the chat template
+tokenizer = get_chat_template(
+    tokenizer,
+    chat_template="llama-3.1",  # Use the LLaMA-3.1 chat template
+)
+# Load the text
+with open(INPUT_FILE, "r", encoding="utf-8") as file:
+    text = file.read()
+# Tokenize the text
+tokens = tokenizer.encode(text, truncation=False, add_special_tokens=False)
+# Split tokens into chunks
+chunks = [tokens[i:i + CHUNK_SIZE] for i in range(0, len(tokens), CHUNK_SIZE)]
+# Prepare dataset in conversational format
+dataset = []
+for chunk in chunks:
+    chunk_text = tokenizer.decode(chunk, skip_special_tokens=True)
+    # Split the chunk into three parts randomly
+    split_points = sorted(random.sample(range(len(chunk_text)), 2))  # Two random split points
+    context = chunk_text[:split_points[0]]
+    instruction = chunk_text[split_points[0]:split_points[1]]
+    response = chunk_text[split_points[1]:]
+    # Format as a conversation
+    conversation = [
+        {"role": "user", "content": f"### GIVEN THE CONTEXT: {context}  ### INSTRUCTION: {instruction}"},
+        {"role": "assistant", "content": response},
+    ]
+    # Add to dataset
+    dataset.append({"conversations": conversation})
+# Save dataset locally as a .jsonl file
+with open(OUTPUT_FILE, "w", encoding="utf-8") as file:
+    for item in dataset:
+        json.dump(item, file)
+        file.write("\n")
+print(f"Dataset saved locally to {OUTPUT_FILE}")
+# Apply the formatting function
+def formatting_prompts_func(examples):
+    convos = examples["conversations"]
+    texts = [tokenizer.apply_chat_template(convo, tokenize=False, add_generation_prompt=False) for convo in convos]
+    return {"text": texts}
+# Apply the formatting function using Dataset.from_dict
+dataset = Dataset.from_dict({"conversations": [d['conversations'] for d in dataset]})
+formatted_dataset = dataset.map(formatting_prompts_func, batched=True, remove_columns=['conversations'])
+# Save the formatted dataset
+formatted_dataset.to_json("formatted_finnegans_wake_dataset.jsonl")
+print("Formatted dataset saved to formatted_finnegans_wake_dataset.jsonl")
+# Load the formatted dataset using load_dataset
+dataset = load_dataset("json", data_files="formatted_finnegans_wake_dataset.jsonl", split="train")
+dataset = dataset
+```
+## Iteration 2 (Fail):
 Dataset: same (forgot to save config with new dataset).