kmagai
/

llm-jp-3-13b-finetune-2

@@ -11,12 +11,133 @@ language:
 - en
 ---
-# Uploaded  model
 - **Developed by:** kmagai
 - **License:** apache-2.0
-- **Finetuned from model :** llm-jp/llm-jp-3-13b
 This llama model was trained 2x faster with [Unsloth](https://github.com/unslothai/unsloth) and Huggingface's TRL library.
 [<img src="https://raw.githubusercontent.com/unslothai/unsloth/main/images/unsloth%20made%20with%20love.png" width="200"/>](https://github.com/unslothai/unsloth)

 - en
 ---
+# Uploaded model
 - **Developed by:** kmagai
 - **License:** apache-2.0
+- **Finetuned from model:** llm-jp/llm-jp-3-13b
 This llama model was trained 2x faster with [Unsloth](https://github.com/unslothai/unsloth) and Huggingface's TRL library.
 [<img src="https://raw.githubusercontent.com/unslothai/unsloth/main/images/unsloth%20made%20with%20love.png" width="200"/>](https://github.com/unslothai/unsloth)
+## JSONL Output Process
+### Model Inference Setup
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
+import torch
+from tqdm import tqdm
+import json
+# QLoRA config for 4-bit quantization
+bnb_config = BitsAndBytesConfig(
+    load_in_4bit=True,
+    bnb_4bit_quant_type="nf4",
+    bnb_4bit_compute_dtype=torch.bfloat16,
+    bnb_4bit_use_double_quant=False,
+)
+# Load model and tokenizer
+model = AutoModelForCausalLM.from_pretrained(
+    model_name,
+    quantization_config=bnb_config,
+    device_map="auto",
+    token=HF_TOKEN
+)
+tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True, token=HF_TOKEN)
+```
+### Input Data Processing
+The script reads input data from a JSONL file (`elyza-tasks-100-TV_0.jsonl`). Each line contains a JSON object with task information:
+```python
+datasets = []
+with open("./elyza-tasks-100-TV_0.jsonl", "r") as f:
+    item = ""
+    for line in f:
+        line = line.strip()
+        item += line
+        if item.endswith("}"):
+            datasets.append(json.loads(item))
+            item = ""
+```
+### Generation Process
+For each input in the dataset:
+1. Format the prompt with instruction template
+2. Tokenize the input
+3. Generate response using the model
+4. Decode the output
+5. Create result object with task_id and output
+```python
+results = []
+for data in tqdm(datasets):
+    input = data["input"]
+    prompt = f"""### Instruction
+    {input}
+    ### Response:
+    """
+    tokenized_input = tokenizer.encode(prompt, add_special_tokens=False, return_tensors="pt").to(model.device)
+    with torch.no_grad():
+        outputs = model.generate(
+            tokenized_input,
+            max_new_tokens=100,
+            do_sample=False,
+            repetition_penalty=1.2
+        )[0]
+    output = tokenizer.decode(outputs[tokenized_input.size(1):], skip_special_tokens=True)
+    results.append({"task_id": data["task_id"], "input": input, "output": output})
+```
+### Generation Parameters
+- `max_new_tokens=100`: Maximum number of tokens to generate
+- `do_sample=False`: Deterministic generation (same output every time)
+- `repetition_penalty=1.2`: Penalize repetition in generated text
+### Output Format
+The generated responses are saved in a JSONL file with the following format:
+```json
+{"task_id": "task_1", "input": "input text", "output": "generated response"}
+```
+Required fields:
+- `task_id`: Unique identifier for the task
+- `output`: Response generated by the model
+Optional fields:
+- `input`: Input text (can be omitted in submission)
+## Training Data Format
+The training data should be provided in JSONL (JSON Lines) format, where each line represents a single JSON object containing the following fields:
+```json
+{
+    "instruction": "Task instruction text",
+    "input": "Input text (optional)",
+    "output": "Expected output text"
+}
+```
+### Fields Description
+- `instruction`: Task instruction that tells the model what to do
+- `input`: (Optional) Input text that provides specific context for the instruction
+- `output`: Expected output that represents the ideal response
+### Example
+```json
+{"instruction": "以下の文章を要約してください。", "input": "人工知能（AI）は、人間の知能を模倣し、学習、推論、判断などを行うコンピュータシステムです。近年、機械学習や深層学習の発展により、画像認識、自然言語処理、ゲームなど様々な分野で人間に匹敵する、あるいは人間を超える性能を示しています。", "output": "AIは人間の知能を模倣するコンピュータシステムで、機械学習の発展により多くの分野で高い性能を示している。"}
+{"instruction": "次の英文を日本語に翻訳してください。", "input": "Artificial Intelligence is transforming the way we live and work.", "output": "人工知能は私たちの生活と仕事の仕方を変革しています。"}