File size: 3,731 Bytes
5906627
 
 
 
a1f3e11
1e68fe9
a1f3e11
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8eec5b4
ac03e55
8eec5b4
bb67c21
8eec5b4
ac03e55
8eec5b4
ac03e55
 
 
 
8eec5b4
 
 
 
bb67c21
 
 
 
 
 
 
8eec5b4
e6462f9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
---
license: llama3
language:
- tr
model-index:
- name: cere-llama-3-8b-tr
  results:
  - task:
      type: text-generation
      name: Text Generation
    dataset:
      name: AI2 Reasoning Challenge TR
      type: ai2_arc
      config: ARC-Challenge
      split: test
      args:
        num_few_shot: 25
    metrics:
    - type: acc
      value: 44.03
      name: accuracy
  - task:
      type: text-generation
      name: Text Generation
    dataset:
      name: HellaSwag TR
      type: hellaswag
      split: validation
      args:
        num_few_shot: 10
    metrics:
    - type: acc
      value: 46.73
      name: accuracy
  - task:
      type: text-generation
      name: Text Generation
    dataset:
      name: MMLU TR
      type: cais/mmlu
      config: all
      split: test
      args:
        num_few_shot: 5
    metrics:
    - type: acc
      value: 49.11
      name: accuracy
  - task:
      type: text-generation
      name: Text Generation
    dataset:
      name: TruthfulQA TR
      type: truthful_qa
      config: multiple_choice
      split: validation
      args:
        num_few_shot: 0
    metrics:
    - type: acc
      name: accuracy
      value: 48.21
  - task:
      type: text-generation
      name: Text Generation
    dataset:
      name: Winogrande TR
      type: winogrande
      config: winogrande_xl
      split: validation
      args:
        num_few_shot: 10
    metrics:
    - type: acc
      value: 54.98
      name: accuracy
  - task:
      type: text-generation
      name: Text Generation
    dataset:
      name: GSM8k TR
      type: gsm8k
      config: main
      split: test
      args:
        num_few_shot: 5
    metrics:
    - type: acc
      value: 51.78
      name: accuracy
---
# CERE-LLMA-3-8b-TR

This model is an fine-tuned version of a Llama3 8b Large Language Model (LLM) for Turkish. It was trained on a high quality Turkish instruction sets created from various open-source and internal resources. Turkish Instruction dataset carefully annotated to carry out Turkish instructions in an accurate and organized manner. 

## Model Details

- **Base Model**: LLMA 3 7B based LLM
- **Tokenizer Extension**: Specifically extended for Turkish
- **Training Dataset**: Cleaned Turkish raw data with 5 billion tokens, custom Turkish instruction sets
- **Training Method**: Initially with DORA, followed by fine-tuning with LORA

[Open LLM Turkish Leaderboard v0.2 Evaluation Results]

Metric	Value
Avg.	
AI2 Reasoning Challenge_tr	
HellaSwag_tr
MMLU_tr	
TruthfulQA_tr
Winogrande _tr
GSM8k_tr

## Usage Examples

```python

from transformers import AutoModelForCausalLM, AutoTokenizer
device = "cuda" # the device to load the model onto

model = AutoModelForCausalLM.from_pretrained(
    "Cerebrum/cere-llama-3-8b-tr",
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("Cerebrum/cere-llama-3-8b-tr")

prompt = "Python'da ekrana 'Merhaba Dünya' nasıl yazılır?"
messages = [
    {"role": "system", "content": "Sen, Cerebrum Tech tarafından üretilen ve verilen talimatları takip ederek en iyi cevabı üretmeye çalışan yardımcı bir yapay zekasın."},
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(device)

generated_ids = model.generate(
    model_inputs.input_ids,
    temperature=0.3,
    top_k=50,
    top_p=0.9,
    max_new_tokens=512,
    repetition_penalty=1,
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
```