zakariarada
/

TCLM-beta

Text Generation

large language model

text-generation-inference

Model card Files Files and versions

zakariarada commited on Oct 5, 2024

Commit

64c7c3c

·

verified ·

1 Parent(s): a11be67

Update README.md

Files changed (1) hide show

README.md +83 -0

README.md CHANGED Viewed

@@ -17,7 +17,90 @@ This model was trained using [H2O LLM Studio](https://github.com/h2oai/h2o-llmst
 - Base model: [h2oai/h2o-danube3-500m-chat](https://huggingface.co/h2oai/h2o-danube3-500m-chat)
 - Fine-tuning dataset: [zakariarada/oasst](https://huggingface.co/datasets/zakariarada/oasst)
 ## Usage
 To use the model with the `transformers` library on a machine with GPUs, first make sure you have the `transformers` library installed.

 - Base model: [h2oai/h2o-danube3-500m-chat](https://huggingface.co/h2oai/h2o-danube3-500m-chat)
 - Fine-tuning dataset: [zakariarada/oasst](https://huggingface.co/datasets/zakariarada/oasst)
+## Training
+To train the model using your custom dataset, you can follow the steps below. This example demonstrates how to fine-tune the `h2oai/h2o-danube3-500m-chat` model using the Hugging Face `transformers` library.
+### Code Example
+```python
+import pandas as pd
+from transformers import (
+    AutoTokenizer,
+    AutoModelForCausalLM,
+    TrainingArguments,
+    Trainer
+)
+from datasets import Dataset
+# Load Dataset
+data_path = "train_full.pq"
+df = pd.read_parquet(data_path)
+# Prepare Dataset for Training
+dataset = Dataset.from_pandas(df)
+def preprocess_function(examples):
+    # Combine 'instruction' and 'parent_id' as input prompt
+    instruction = examples["instruction"]
+    parent_id = examples["parent_id"]
+    input_prompt = f"{parent_id}: {instruction}" if parent_id else instruction
+    return {
+        "input_text": input_prompt,
+        "target_text": examples["output"]
+    }
+# Preprocess Dataset
+dataset = dataset.map(preprocess_function, remove_columns=["id", "parent_id", "instruction", "output"])
+# Load Tokenizer and Model
+model_name = "h2oai/h2o-danube3-500m-chat"
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+model = AutoModelForCausalLM.from_pretrained(model_name)
+# Tokenize Data
+def tokenize_function(examples):
+    return tokenizer(
+        examples["input_text"],
+        padding="max_length",
+        truncation=True,
+        max_length=512
+    )
+tokenized_dataset = dataset.map(tokenize_function, batched=True)
+# Training Arguments
+training_args = TrainingArguments(
+    output_dir="./output/TCLM-beta/",
+    num_train_epochs=1,
+    per_device_train_batch_size=2,
+    gradient_accumulation_steps=1,
+    evaluation_strategy="epoch",
+    save_strategy="epoch",
+    learning_rate=1e-4,
+    weight_decay=0.0,
+    lr_scheduler_type="cosine",
+    warmup_ratio=0.0,
+    logging_dir="./logs",
+    logging_steps=10,
+    fp16=True,
+    save_total_limit=1,
+    load_best_model_at_end=True,
+    metric_for_best_model="loss",
+    greater_is_better=False
+)
+# Trainer Setup
+trainer = Trainer(
+    model=model,
+    args=training_args,
+    train_dataset=tokenized_dataset,
+    tokenizer=tokenizer,
+)
+# Train Model
+trainer.train()
+```
 ## Usage
 To use the model with the `transformers` library on a machine with GPUs, first make sure you have the `transformers` library installed.