zakariarada commited on
Commit
64c7c3c
1 Parent(s): a11be67

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +83 -0
README.md CHANGED
@@ -17,7 +17,90 @@ This model was trained using [H2O LLM Studio](https://github.com/h2oai/h2o-llmst
17
  - Base model: [h2oai/h2o-danube3-500m-chat](https://huggingface.co/h2oai/h2o-danube3-500m-chat)
18
  - Fine-tuning dataset: [zakariarada/oasst](https://huggingface.co/datasets/zakariarada/oasst)
19
 
 
20
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
21
  ## Usage
22
 
23
  To use the model with the `transformers` library on a machine with GPUs, first make sure you have the `transformers` library installed.
 
17
  - Base model: [h2oai/h2o-danube3-500m-chat](https://huggingface.co/h2oai/h2o-danube3-500m-chat)
18
  - Fine-tuning dataset: [zakariarada/oasst](https://huggingface.co/datasets/zakariarada/oasst)
19
 
20
+ ## Training
21
 
22
+ To train the model using your custom dataset, you can follow the steps below. This example demonstrates how to fine-tune the `h2oai/h2o-danube3-500m-chat` model using the Hugging Face `transformers` library.
23
+
24
+ ### Code Example
25
+
26
+ ```python
27
+ import pandas as pd
28
+ from transformers import (
29
+ AutoTokenizer,
30
+ AutoModelForCausalLM,
31
+ TrainingArguments,
32
+ Trainer
33
+ )
34
+ from datasets import Dataset
35
+
36
+ # Load Dataset
37
+ data_path = "train_full.pq"
38
+ df = pd.read_parquet(data_path)
39
+
40
+ # Prepare Dataset for Training
41
+ dataset = Dataset.from_pandas(df)
42
+
43
+ def preprocess_function(examples):
44
+ # Combine 'instruction' and 'parent_id' as input prompt
45
+ instruction = examples["instruction"]
46
+ parent_id = examples["parent_id"]
47
+ input_prompt = f"{parent_id}: {instruction}" if parent_id else instruction
48
+ return {
49
+ "input_text": input_prompt,
50
+ "target_text": examples["output"]
51
+ }
52
+
53
+ # Preprocess Dataset
54
+ dataset = dataset.map(preprocess_function, remove_columns=["id", "parent_id", "instruction", "output"])
55
+
56
+ # Load Tokenizer and Model
57
+ model_name = "h2oai/h2o-danube3-500m-chat"
58
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
59
+ model = AutoModelForCausalLM.from_pretrained(model_name)
60
+
61
+ # Tokenize Data
62
+ def tokenize_function(examples):
63
+ return tokenizer(
64
+ examples["input_text"],
65
+ padding="max_length",
66
+ truncation=True,
67
+ max_length=512
68
+ )
69
+
70
+ tokenized_dataset = dataset.map(tokenize_function, batched=True)
71
+
72
+ # Training Arguments
73
+ training_args = TrainingArguments(
74
+ output_dir="./output/TCLM-beta/",
75
+ num_train_epochs=1,
76
+ per_device_train_batch_size=2,
77
+ gradient_accumulation_steps=1,
78
+ evaluation_strategy="epoch",
79
+ save_strategy="epoch",
80
+ learning_rate=1e-4,
81
+ weight_decay=0.0,
82
+ lr_scheduler_type="cosine",
83
+ warmup_ratio=0.0,
84
+ logging_dir="./logs",
85
+ logging_steps=10,
86
+ fp16=True,
87
+ save_total_limit=1,
88
+ load_best_model_at_end=True,
89
+ metric_for_best_model="loss",
90
+ greater_is_better=False
91
+ )
92
+
93
+ # Trainer Setup
94
+ trainer = Trainer(
95
+ model=model,
96
+ args=training_args,
97
+ train_dataset=tokenized_dataset,
98
+ tokenizer=tokenizer,
99
+ )
100
+
101
+ # Train Model
102
+ trainer.train()
103
+ ```
104
  ## Usage
105
 
106
  To use the model with the `transformers` library on a machine with GPUs, first make sure you have the `transformers` library installed.