Morfoz-Aigap
commited on
Commit
•
3bc0750
1
Parent(s):
5cf94cc
Update README.md
Browse files
README.md
CHANGED
@@ -2,5 +2,69 @@
|
|
2 |
license: apache-2.0
|
3 |
language:
|
4 |
- tr
|
5 |
-
|
6 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
2 |
license: apache-2.0
|
3 |
language:
|
4 |
- tr
|
5 |
+
---
|
6 |
+
|
7 |
+
|
8 |
+
# Morfoz-LLM-8b-v1.0
|
9 |
+
|
10 |
+
This model is an extended version of a Llama-3 8B Instruct-based Large Language Model (LLM) for Turkish. It was trained on a cleaned Turkish raw dataset. We utilized Turkish instruction sets created from various open-source for fine-tuning with the LORA method.
|
11 |
+
## Model Details
|
12 |
+
|
13 |
+
- **Base Model**: Meta Llama-3 8B Instruct
|
14 |
+
- **Tokenizer Extension**: Specifically extended for Turkish
|
15 |
+
- **Training Dataset**: Cleaned Turkish raw data with custom Turkish instruction sets
|
16 |
+
- **Training Method**: Fine-tuning with LORA
|
17 |
+
|
18 |
+
|
19 |
+
### LORA Fine-Tuning Configuration
|
20 |
+
|
21 |
+
- `lora_alpha`: 16
|
22 |
+
- `lora_dropout`: 0.05
|
23 |
+
- `r`: 64
|
24 |
+
- `target_modules`: "all-linear"
|
25 |
+
|
26 |
+
## Usage Examples
|
27 |
+
|
28 |
+
```python
|
29 |
+
|
30 |
+
from transformers import AutoTokenizer, AutoModelForCausalLM
|
31 |
+
import torch
|
32 |
+
|
33 |
+
tokenizer = AutoTokenizer.from_pretrained("Morfoz-Aigap/Morfoz-LLM-8b-v1.0")
|
34 |
+
model = AutoModelForCausalLM.from_pretrained("Morfoz-Aigap/Morfoz-LLM-8b-v1.0", torch_dtype=torch.bfloat16, device_map={"": 0},low_cpu_mem_usage=True)
|
35 |
+
|
36 |
+
messages = [
|
37 |
+
{"role": "user", "content": "Kırmızı başlıklı kız adında kısa bir çocuk hikayesi yazabilir misin?"}
|
38 |
+
|
39 |
+
]
|
40 |
+
|
41 |
+
top_k = 50
|
42 |
+
top_p = 0.9
|
43 |
+
temperature = 0.6
|
44 |
+
def get_formatted_input(messages):
|
45 |
+
|
46 |
+
for item in messages:
|
47 |
+
if item['role'] == "user":
|
48 |
+
item['content'] = item['content']
|
49 |
+
break
|
50 |
+
|
51 |
+
conversation = '\n\n'.join(["User: " + item["content"] if item["role"] == "user" else "Assistant: " + item["content"] for item in messages]) + "\n\nAssistant:"
|
52 |
+
formatted_input = "\n\n" + conversation
|
53 |
+
|
54 |
+
return formatted_input
|
55 |
+
|
56 |
+
formatted_input = get_formatted_input(messages)
|
57 |
+
print(formatted_input)
|
58 |
+
tokenized_prompt = tokenizer(tokenizer.bos_token + formatted_input, return_tensors="pt").to(model.device)
|
59 |
+
|
60 |
+
terminators = [
|
61 |
+
tokenizer.eos_token_id,
|
62 |
+
tokenizer.convert_tokens_to_ids("<|eot_id|>")
|
63 |
+
]
|
64 |
+
|
65 |
+
outputs = model.generate(input_ids=tokenized_prompt.input_ids, do_sample = True, attention_mask=tokenized_prompt.attention_mask, max_new_tokens=256, eos_token_id=terminators, top_p=top_p, temperature=temperature)
|
66 |
+
|
67 |
+
response = outputs[0][tokenized_prompt.input_ids.shape[-1]:]
|
68 |
+
print(tokenizer.decode(response, skip_special_tokens=True))
|
69 |
+
|
70 |
+
|