|
--- |
|
library_name: transformers |
|
license: cc-by-nc-4.0 |
|
datasets: |
|
- kyujinpy/KOR-OpenOrca-Platypus-v3 |
|
language: |
|
- ko |
|
- en |
|
tags: |
|
- Economic |
|
- Finance |
|
base_model: davidkim205/komt-mistral-7b-v1 |
|
--- |
|
|
|
|
|
# Model Details |
|
Model Developers: Sogang University SGEconFinlab(<<https://sc.sogang.ac.kr/aifinlab/>) |
|
|
|
|
|
### Model Description |
|
|
|
This model is a language model specialized in economics and finance. This was learned with various economic/finance-related data. |
|
The data sources are listed below, and we are not releasing the data that we trained on because it was used for research/policy purposes. |
|
If you wish to use the original data, please contact the original author directly for permission to use it. |
|
|
|
- **Developed by:** Sogang University SGEconFinlab(<https://sc.sogang.ac.kr/aifinlab/>) |
|
- **License:** cc-by-nc-4.0 |
|
- **Base Model:** davidkim205/komt-mistral-7b-v1(<https://huggingface.co/davidkim205/komt-mistral-7b-v1>) |
|
|
|
|
|
## Loading the Model |
|
|
|
peft_model_id = "SGEcon/komt-mistral-7b-v1_fin_v5" |
|
config = PeftConfig.from_pretrained(peft_model_id) |
|
bnb_config = BitsAndBytesConfig( |
|
load_in_4bit=True, |
|
bnb_4bit_use_double_quant=True, |
|
bnb_4bit_quant_type="nf4", |
|
bnb_4bit_compute_dtype=torch.bfloat16 |
|
) |
|
model = AutoModelForCausalLM.from_pretrained(config.base_model_name_or_path, quantization_config=bnb_config, device_map={"":0}) |
|
model = PeftModel.from_pretrained(model, peft_model_id) |
|
tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path) |
|
model.eval() |
|
streamer = TextStreamer(tokenizer) |
|
|
|
## Conducting Conversation |
|
|
|
def gen(x): |
|
generation_config = GenerationConfig( |
|
temperature=0.8, |
|
top_p=0.8, |
|
top_k=100, |
|
max_new_tokens=1024, |
|
early_stopping=True, |
|
do_sample=True, |
|
) |
|
q = f"[INST]{x} [/INST]" |
|
gened = model.generate( |
|
**tokenizer( |
|
q, |
|
return_tensors='pt', |
|
return_token_type_ids=False |
|
).to('cuda'), |
|
generation_config=generation_config, |
|
pad_token_id=tokenizer.eos_token_id, |
|
eos_token_id=tokenizer.eos_token_id, |
|
streamer=streamer, |
|
) |
|
result_str = tokenizer.decode(gened[0]) |
|
|
|
# μ
λ ₯ μ§λ¬Έκ³Ό "[INST]" λ° "[/INST]" νκ·Έ μ κ±° |
|
input_question_with_tags = f"[INST]{x} [/INST]" |
|
result_str = result_str.replace(input_question_with_tags, "").strip() |
|
|
|
# "<s>" λ° "</s>" νκ·Έ μ κ±° |
|
result_str = result_str.replace("<s>", "").replace("</s>", "").strip() |
|
|
|
return result_str |
|
|
|
|
|
|
|
|
|
## Training Details |
|
|
|
We use QLora to train the base model. |
|
Quantized Low Rank Adapters (QLoRA) is an efficient technique that uses 4-bit quantized pre-trained language models to fine-tune 65 billion parameter models on a 48 GB GPU while significantly reducing memory usage. |
|
The method uses NormalFloat 4-bit (NF4), a new data type that is theoretically optimal for normally distributed weights; Double Quantization, which further quantizes quantization constants to reduce average memory usage; and Paged Optimizers, which manage memory spikes during mini-batch processing, to increase memory efficiency without sacrificing performance. |
|
|
|
Also, we performed instruction tuning using the data that we collected and the kyujinpy/KOR-OpenOrca-Platypus-v3 dataset on the hugging face. |
|
Instruction tuning is learning in a supervised learning format that uses instructions and input data together as input and output data as a pair. |
|
|
|
|
|
|
|
|
|
### Training Data |
|
|
|
1. νκ΅μν: κ²½μ κΈμ΅μ©μ΄ 700μ (<https://www.bok.or.kr/portal/bbs/B0000249/view.do?nttId=235017&menuNo=200765>) |
|
2. κΈμ΅κ°λ
μ: κΈμ΅μλΉμ μ 보 ν¬νΈ νμΈ κΈμ΅μ©μ΄μ¬μ (<https://fine.fss.or.kr/fine/fnctip/fncDicary/list.do?menuNo=900021>) |
|
3. KDI κ²½μ μ 보μΌν°: μμ¬ μ©μ΄μ¬μ (<https://eiec.kdi.re.kr/material/wordDic.do>) |
|
4. νκ΅κ²½μ μ λ¬Έ/νκ²½λ·μ»΄: νκ²½κ²½μ μ©μ΄μ¬μ (<https://terms.naver.com/list.naver?cid=42107&categoryId=42107>), μ€λμ TESAT(<https://www.tesat.or.kr/bbs.frm.list/tesat_study?s_cateno=1>), μ€λμ μ£Όλμ΄ TESAT(<https://www.tesat.or.kr/bbs.frm.list/tesat_study?s_cateno=5>), μκΈμκΈνκ²½(<https://sgsg.hankyung.com/tesat/study>) |
|
5. μ€μλ²€μ²κΈ°μ
λΆ/λνλ―Όκ΅μ λΆ: μ€μλ²€μ²κΈ°μ
λΆ μ λ¬Έμ©μ΄(<https://terms.naver.com/list.naver?cid=42103&categoryId=42103>) |
|
6. κ³ μ±μΌ/λ²λ¬ΈμΆνμ¬: νκ³Β·μΈλ¬΄ μ©μ΄μ¬μ (<https://terms.naver.com/list.naver?cid=51737&categoryId=51737>) |
|
7. 맨νμ κ²½μ ν 8ν Word Index |
|
8. kyujinpy/KOR-OpenOrca-Platypus-v3(<https://huggingface.co/datasets/kyujinpy/KOR-OpenOrca-Platypus-v3>) |
|
|
|
|
|
At the request of the original author, it is not to be used for commercial purposes. Therefore, it is licensed under the license CC-BY-NC-4.0. |
|
The copyright of the data used belongs to the original author, so please contact the original author when using it. |
|
|
|
|
|
|
|
|
|
### Training Hyperparameters |
|
|
|
|Hyperparameter|SGEcon/komt-mistral-7b-v1_fin_v5| |
|
|------|---| |
|
|Lora Method|Lora| |
|
|load in 4 bit|True| |
|
|learning rate|3e-5| |
|
|lora alpa|8| |
|
|lora rank|32| |
|
|lora dropout|0.05| |
|
|optim|adamw_torch| |
|
|target_modules|o_proj, q_proj, up_proj, down_proj, gate_proj, k_proj, v_proj| |
|
|
|
|
|
|
|
### Example |
|
|
|
> μ€μμνμ μν μ λν΄μ μ€λͺ
ν΄μ€λ? |
|
|
|
>> μ€μμνμ ν΅νλ°νκΆκ³Ό κΈμ΅ν΅μ κΆμ κ°μ§ κΈ°κ΄μ΄λ€. μ€μμνμ κ΅κ°μ ν΅νμ μ±
, μΈνμ μ±
, κΈμ΅μ μ±
μ μ립νλ λμμ μμ
μνκ³Ό κ°μ κΈμ΅κΈ°κ΄μ κ°λ
Β·κ°λ
νλ μ
무λ₯Ό μννλ€. μ€μμνμ μ λΆμ μμ
μνμ λν μκΈλλΆκΈ°κ΄μ΄λ€. μμ
μνμ μ€μμνμ μκΈμ λΉλ¦¬κ±°λ μκΈνλ€. μ€μμνμ ν΅νμ μ©μ μ±
μ μννκΈ° μν΄ κΈμ΅κΈ°κ΄μ ν΅ν΄ μκΈμ λμΆνκ±°λ μκΈ λ°λλ€. μ€μμνμ μμ
μνμ λν μκΈλλΆκΈ°κ΄μ μν κ³Ό ν¨κ» μμ€μνμ λν κ°λ
Β·κ°λ
μ μν μ μννλ€. μμ
μνμ΄ μκΈμ λμΆν λλ 1μ°¨μ μΌλ‘ μμ
μνμ λμΆκΈμ μ§κΈνλ λμ , λμΆμνμ λμΆκΈμ μΌλΆ λλ μ μ‘μ μκΈμΌλ‘ λ°μ μ€μμνμ λμ λΉλ €μ£Όκ³ μκΈνλ€. μκΈμ λν μ΄μμ¨μ λμ¬ μκΈμκ° μ€μμνμ μκΈμ νκ²λ μ λνλ κ²μ΄λ€. ννΈ μμ
μνμ λμΆμ ν λ λμΆμνμ΄ λμΆκΈμ μκΈνλ λμ , λμΆμ λ°λ μνμ λμΆκΈμ μ§κΈνλ€. |
|
|
|
|
|
|
|
|
|
|