File size: 3,341 Bytes
470ffd3
43f9d47
 
 
 
 
 
470ffd3
43f9d47
 
470ffd3
3651c9b
43f9d47
3651c9b
 
43f9d47
 
 
3651c9b
 
43f9d47
 
bc458d2
43f9d47
843c8ab
43f9d47
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
115cba5
43f9d47
115cba5
43f9d47
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3dff85a
43f9d47
 
3651c9b
43f9d47
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
---
license: apache-2.0
language:
- ko
base_model:
- meta-llama/Llama-3.2-1B-Instruct
pipeline_tag: text-generation
library_name: transformers
datasets:
- nayohan/CodeFeedback-Filtered-Instruction-ko
---
### Model Card for Model ID
- base_model : [meta-llama/Llama-3.2-1B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct)

### Training dataset
- data_set : [nayohan/CodeFeedback-Filtered-Instruction-ko](https://huggingface.co/datasets/nayohan/CodeFeedback-Filtered-Instruction-ko)
  - ํ•ด๋‹น ๋ฐ์ดํ„ฐ์…‹์„ ์ „๋ถ€ ์‚ฌ์šฉํ•œ๊ฑด ์•„๋‹ˆ๋ฉฐ Python์–ธ์–ด๋ฅผ ์šฐ์„  ์ถ”์ถœํ•œ๋‹ค์Œ ๋ฐ์ดํ„ฐ์…‹๋“ค์˜ ์ƒ๊น€์ƒˆ๋ฅผ ํŒŒ์•…, ๊ทธ ๋‹ค์Œ ์ „์ฒ˜๋ฆฌ๊ฐ€ ๊ณตํ†ต์ ์œผ๋กœ ๋“ค์–ด๊ฐˆ๋งŒํ•œ ๋ฐ์ดํ„ฐ๋ฅผ ๋‹ค์‹œ ์ถ”์ถœํ•˜์—ฌ ํ•™์Šต์— ์‚ฌ์šฉํ–ˆ์Šต๋‹ˆ๋‹ค.
    - ์ด ํ•™์Šต ๋ฐ์ดํ„ฐ ๊ฑด : 49,859 ๊ฑด

### Basic usage
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = 'MDDDDR/Llama-3.2-1B-Instruct-FFT-coder-python'
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id,
                                             device_map="cuda:0",
                                             torch_dtype=torch.bfloat16)


instruction = '''LCS(Longest Common Subsequence, ์ตœ์žฅ ๊ณตํ†ต ๋ถ€๋ถ„ ์ˆ˜์—ด)๋ฌธ์ œ๋Š” ๋‘ ์ˆ˜์—ด์ด ์ฃผ์–ด์กŒ์„ ๋•Œ, ๋ชจ๋‘์˜ ๋ถ€๋ถ„ ์ˆ˜์—ด์ด ๋˜๋Š” ์ˆ˜์—ด ์ค‘ ๊ฐ€์žฅ ๊ธด ๊ฒƒ์„ ์ฐพ๋Š” ๋ฌธ์ œ์ด๋‹ค.

์˜ˆ๋ฅผ ๋“ค์–ด, ACAYKP์™€ CAPCAK์˜ LCS๋Š” ACAK๊ฐ€ ๋œ๋‹ค.

###์ž…๋ ฅ : ์ฒซ์งธ ์ค„๊ณผ ๋‘˜์งธ ์ค„์— ๋‘ ๋ฌธ์ž์—ด์ด ์ฃผ์–ด์ง„๋‹ค. ๋ฌธ์ž์—ด์€ ์•ŒํŒŒ๋ฒณ ๋Œ€๋ฌธ์ž๋กœ๋งŒ ์ด๋ฃจ์–ด์ ธ ์žˆ์œผ๋ฉฐ, ์ตœ๋Œ€ 1000๊ธ€์ž๋กœ ์ด๋ฃจ์–ด์ ธ ์žˆ๋‹ค.
###์ถœ๋ ฅ : ์ฒซ์งธ ์ค„์— ์ž…๋ ฅ์œผ๋กœ ์ฃผ์–ด์ง„ ๋‘ ๋ฌธ์ž์—ด์˜ LCS์˜ ๊ธธ์ด๋ฅผ ์ถœ๋ ฅํ•œ๋‹ค.

###์ž…๋ ฅ ์˜ˆ์ œ : 
ACAYKP
CAPCAK
###์ถœ๋ ฅ ์˜ˆ์ œ : 4
'''

messages = [
    {
        "role":"user",
        "content":"์•„๋ž˜๋Š” ๋ฌธ์ œ๋ฅผ ์„ค๋ช…ํ•˜๋Š” ์ง€์‹œ์‚ฌํ•ญ์ž…๋‹ˆ๋‹ค. ์ด ์š”์ฒญ์— ๋Œ€ํ•ด ์ ์ ˆํ•˜๊ฒŒ ๋‹ต๋ณ€ํ•ด์ฃผ์„ธ์š”.\n###์ง€์‹œ์‚ฌํ•ญ:{instruction}\n###๋‹ต๋ณ€:".format(instruction=instruction)
    }
]

with torch.no_grad():
  prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=False)
  inputs = tokenizer(prompt, return_tensors="pt", padding=False).to('cuda')
  outputs = model.generate(**inputs, 
                           use_cache=False, 
                           max_length=256, 
                           top_p=0.9,
                           temperature=0.7, 
                           repetition_penalty=1.0,
                           pad_token_id=tokenizer.pad_token_id)

output_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
final_output = output_text.split('๋‹ต๋ณ€:')[-1].strip()
print(final_output)
# ```python
# def longest_common_subsequence(str1, str2):
#     m = len(str1)
#     n = len(str2)
#     dp = [[0] * (n+1) for _ in range(m+1)]
#     
#     for i in range(m+1):
#         for j in range(n+1):
#             if i == 0 or j == 0:
#                 dp[i][j] = 0
#             elif str1[i-1] == str2[j-1]:
#                 dp[i][j] = dp[i-1][j-1] + 1
#             else:
#                 dp[i][j] = max(dp[i-1][j], dp[i][j-1])
#     
#     return dp[m][n]
# 
# print(longest_common_subsequence("ACAYKP", "CAPCAK"))  # Output: 4
# ```
```

### Hardware
- A100 40GB x 1
- Training Time : 1 hour 45 minutes