MKLLM-7B-Instruct
MKLLM-7B is an open-source Large Language Model for the Macedonian language. The model is built on top of the amazing Mistral-7B-v0.1 model by continued pretraining on a mix of Macedonian and English text. A corpus of around 300M tokens, repeated in 2 epochs, was used for the training and even though this might be considered small compared to other similar projects, the resulting model is very capable in understanding and processing the Macedonian language.
This is the instruction-tuned version of MKLLM-7B. It was trained by taking MKLLM-7B and then performing a full instruction training with axolotl by using the chatml format for conversations.
We tested the model against Meta's Llama3-8B-Instruct and Mistral's Mistral-7B-Instruct-v0.3 on a set of benchmarks we translated in Macedonian and the model performs better than both leading models in its category. Additionally, these benchmarks are primarily focused on understanding and do not measure generation capabilities and fluency, in these categories we believe there's an even larger difference in performance as MKLLM-7B-Instruct writes much more coherent Macedonian. The benchmarking was done with: https://github.com/N13T/mk-llm-eval
In order to leverage the instruction training your prompt should follow the chatml format:
<|im_start|>system
Разговор помеѓу љубопитен корисник и асистент со вештачка интелигенција. Асистентот дава корисни, детални и љубезни одговори на прашањата на корисникот.<|im_end|>
<|im_start|>user
Која планета е позната како 'Црвената Планета'?<|im_end|>
<|im_start|>assistant
Марс<|im_end|>
This prompt is available as a chat template, which means you can format messages using the
tokenizer.apply_chat_template()
method:
messages = [
{"role": "system", "content": "Разговор помеѓу љубопитен корисник и асистент со вештачка интелигенција. Асистентот дава корисни, детални и љубезни одговори на прашањата на корисникот."},
{"role": "user", "content": "Која планета е позната како 'Црвената Планета'?"}
]
gen_input = tokenizer.apply_chat_template(messages,
tokenize=True,
return_dict=True,
return_tensors="pt",
add_generation_prompt=True).to("cuda")
with torch.no_grad():
generated_ids = model.generate(**gen_input, max_new_tokens=150,
do_sample=True,
temperature=0.1,
repetition_penalty=1.1,
)
print(tokenizer.decode(generated_ids[0][prompt["input_ids"].shape[1]:], skip_special_tokens=False))
Notes
- MKLLM-7B-Instruct can hallucinate and produce factually incorrect output. This is especially pronounced when discussing Macedonian topics due to the smaller training dataset.
- Downloads last month
- 42