File size: 6,104 Bytes
e56c3e9 5791b0d e56c3e9 5791b0d e56c3e9 5791b0d e56c3e9 0b62f3f e56c3e9 6e563cc e56c3e9 5791b0d e56c3e9 5791b0d e56c3e9 5791b0d e56c3e9 6e563cc 5791b0d 6e563cc 5791b0d 6e563cc 5791b0d 6e563cc c65f9fd 6e563cc 5791b0d 0b62f3f 5791b0d |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 |
---
library_name: transformers
license: apache-2.0
datasets:
- nampdn-ai/tiny-codes
- nlpai-lab/openassistant-guanaco-ko
- philschmid/guanaco-sharegpt-style
language:
- ko
- en
inference: false
tags:
- unsloth
- phi-3
pipeline_tag: text-generation
---
# Phi-3-medium-4k-instruct-ko-poc-v0.1
## Model Details
This model is trained using unsloth toolkit based on Microsoft's phi-3 Phi-3-medium-4k-instruct model (https://huggingface.co/unsloth/Phi-3-medium-4k-instruct) with some Korean instruction data added to enhance its Korean generation performance
Since my role is not as a working developer, but as ML Technical Specialist helping customers with quick PoCs/prototypes, and I was limited by Azure GPU resources available, I only trained with 40,000 samples on a single VM Azure Standard_NC24ads_A100_v4 for PoC purposes. Because I have not done any tokenizer extensions, you need a lot more tokens than English for text generation.
### Dataset
The dataset used for training is as follows. To prevent catastrophic forgetting, I included non-Korean corpus as training data. Note that we did not use all of the data, but only sampled some of it. Korean textbooks were converted to Q&A format. The Guanaco dataset has been reformatted to fit the multiturn format like <|user|>\n{Q1}<|end|>\n<|assistant|>\n{A1}<|end|>\n<|user|>\n{Q2}<|end|>\n<|assistant|>\n{A2}<|end|>.
- Korean textbooks (https://huggingface.co/datasets/nampdn-ai/tiny-codes)
- Korean translation of Guanaco (https://huggingface.co/datasets/nlpai-lab/openassistant-guanaco-ko)
- Guanaco Sharegpt style (https://huggingface.co/datasets/philschmid/guanaco-sharegpt-style)
## How to Get Started with the Model
### Code snippets
```python
### Load model
import torch
from unsloth import FastLanguageModel
from unsloth.chat_templates import get_chat_template
from transformers import TextStreamer
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.
model_path = "daekeun-ml/Phi-3-medium-4k-instruct-ko-poc-v0.1"
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = model_tar_dir, # Choose ANY! eg teknium/OpenHermes-2.5-Mistral-7B
max_seq_length = max_seq_length,
dtype = dtype,
load_in_4bit = load_in_4bit,
# token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)
tokenizer = get_chat_template(
tokenizer,
chat_template = "phi-3", # Supports zephyr, chatml, mistral, llama, alpaca, vicuna, vicuna_old, unsloth
mapping = {"role" : "from", "content" : "value", "user" : "human", "assistant" : "gpt"}, # ShareGPT style
)
params = {
"max_new_tokens": 256,
"use_cache": True,
"temperature": 0.05,
"do_sample": True
}
### Inference
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
# 1st example
messages = [
{"from": "human", "value": "Continue the fibonnaci sequence in Korean: 1, 1, 2, 3, 5, 8,"},
{"from": "assistant", "value": "νΌλ³΄λμΉ μμ΄μ λ€μ μ«μλ 13, 21, 34, 55, 89 λ±μ
λλ€. κ° μ«μλ μμ λ μ«μμ ν©μ
λλ€."},
{"from": "human", "value": "Compute 2x+3=12 in Korean"},
]
inputs = tokenizer.apply_chat_template(
messages,
tokenize = True,
add_generation_prompt = True, # Must add for generation
return_tensors = "pt",
).to("cuda")
text_streamer = TextStreamer(tokenizer)
_ = model.generate(input_ids = inputs, streamer = text_streamer, **params)
# 2nd example
messages = [
{"from": "human", "value": "What is Machine Learning in Korean?"},
{"from": "assistant", "value": "μΈκ³΅μ§λ₯μ ν λΆμΌλ‘ λ°©λν λ°μ΄ν°λ₯Ό λΆμν΄ ν₯ν ν¨ν΄μ μμΈ‘νλ κΈ°λ²μ
λλ€."},
{"from": "human", "value": "What is Deep Learning in Korean?"},
]
inputs = tokenizer.apply_chat_template(
messages,
tokenize = True,
add_generation_prompt = True, # Must add for generation
return_tensors = "pt",
).to("cuda")
from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer)
_ = model.generate(input_ids = inputs, streamer = text_streamer, **params)
```
### Inference results
```python
# 1st example
<s><|user|> Continue the fibonnaci sequence in Korean: 1, 1, 2, 3, 5, 8,<|end|><|assistant|> νΌλ³΄λμΉ μμ΄μ λ€μ μ«μλ 13, 21, 34, 55, 89 λ±μ
λλ€. κ° μ«μλ μμ λ μ«μμ ν©μ
λλ€.<|end|><|user|> Compute 2x+3=12 in Korean<|end|><|assistant|> λ°©μ μ 2x + 3 = 12μμ xλ₯Ό νλ €λ©΄ λ€μ λ¨κ³λ₯Ό λ°λ₯΄μμμ€.
1. λ°©μ μμ μμͺ½μμ 3μ λΉΌμ λ°©μ μμ νμͺ½μ λν΄ xλ₯Ό λΆλ¦¬ν©λλ€.
2x + 3 - 3 = 12 - 3
2x = 9
2. μ΄μ λ°©μ μμ μμͺ½μ 2λ‘ λλμ΄ xμ κ°μ ꡬν©λλ€.
2λ°° / 2 = 9 / 2
x = 4.5
λ°λΌμ λ°©μ μ 2x + 3 = 12μ λν ν΄λ x = 4.5μ
λλ€.<|end|>
# 2nd example
<s><|user|> What is Machine Learning in Korean?<|end|><|assistant|> μΈκ³΅μ§λ₯μ ν λΆμΌλ‘ λ°©λν λ°μ΄ν°λ₯Ό λΆμν΄ ν₯ν ν¨ν΄μ μμΈ‘νλ κΈ°λ²μ
λλ€.<|end|><|user|> What is Deep Learning in Korean?<|end|><|assistant|> 볡μ‘ν λ°μ΄ν° μΈνΈλ₯Ό λΆμνκ³ λ³΅μ‘ν ν¨ν΄μ μΈμνκ³ νμ΅νλ λ° μ¬μ©λλ λ₯λ¬λμ λ§μ λ μ΄μ΄λ‘ ꡬμ±λ μ κ²½λ§μ νμ μ§ν©μ
λλ€. μ΄ κΈ°μ μ μ΄λ―Έμ§ μΈμ, μμ°μ΄ μ²λ¦¬ λ° μμ¨ μ΄μ κ³Ό κ°μ λ€μν μμ© λΆμΌμμ ν° λ°μ μ μ΄λ€μ΅λλ€.<|end|>
```
### References
- Base model: [unsloth/Phi-3-medium-4k-instruct](https://huggingface.co/unsloth/Phi-3-medium-4k-instruct)
## Notes
### License
apache 2.0; The license of phi-3 is MIT, but I considered the licensing of the dataset and library used for training.
### Caution
This model was created as a personal experiment, unrelated to the organization I work for. The model may not operate correctly because separate verification was not performed. Please be careful unless it is for personal experimentation or PoC (Proof of Concept)! |