File size: 6,491 Bytes
4d55b2d
 
1b0e2d2
026e5f2
 
 
 
 
 
 
 
 
 
 
 
 
 
1b0e2d2
feddc21
1b0e2d2
 
 
3b21bbd
 
 
 
 
 
 
 
4d55b2d
 
8d97901
 
 
 
39672b2
21e71a4
398e635
21e71a4
 
 
8d97901
39672b2
 
8d97901
39672b2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8d97901
 
6acf238
8d97901
6acf238
 
8d97901
6acf238
 
 
 
 
 
 
 
 
8d97901
 
 
 
 
6acf238
 
 
8d97901
6acf238
8d97901
 
 
 
 
 
 
 
 
 
 
 
 
 
 
39ce5b8
 
 
 
8d97901
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
39672b2
 
eeca298
 
 
 
9ca1132
 
8a57b56
39672b2
 
 
504c496
39672b2
4e348f9
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
---
library_name: transformers
license: apache-2.0
base_model: Hemanth-thunder/Tamil-Mistral-7B-v0.1
Pretrain_Model: mistralai/Mistral-7B-v0.1
tags:
- Mistral
- instruct
- finetune
- chatml
- DPO
- RLHF
- gpt4
- synthetic data
- distillation
- function calling
- json mode
datasets:
- Hemanth-thunder/tamil-open-instruct-v1
language:
- ta
widget:
- example_title: Tamil Chat with LLM
  messages:
  - role: system
    content: >-
      சரியான பதிலுடன் வேலையை வெற்றிகரமாக முடிக்க, வழங்கப்பட்ட வழிகாட்டுதல்களைப்
      பின்பற்றி, தேவையான தகவலை உள்ளிடவும்.
  - role: user
    content: மூன்று இயற்கை கூறுகளை குறிப்பிடவும்.
---

# Model Card for Tamil-Mistral-7B-Instruct-v0.1

The Tamil-Mistral-7B-Instruct-v0.2 Large Language Model (LLM) is an improved instruct fine-tuned version of [Tamil-Mistral-7B-Instruct-v0.1](https://huggingface.co/Hemanth-thunder/Tamil-Mistral-7B-Instruct-v0.1).
Tamil LLM: A Breakthrough in Tamil Language Understanding In the realm of language models, the fine-tuned Tamil Mistral model represents a significant advancement. Unlike its English counterpart, the Tamil Mistral model is specifically tailored to comprehend and generate text in the Tamil language. This innovation addresses a critical gap, as the English Mistral model fails to effectively engage with Tamil, a language rich in culture and heritage. Through extensive fine-tuning with a base Tamil Mistral model, this iteration has been meticulously enhanced to grasp the nuances and intricacies of the Tamil language. As a result, we are delighted to present a revolutionary model that enables seamless interaction through text. Welcome to the future of conversational Tamil language processing with our instructive model.

# Dataset 
Tamil open instruct dataset (400k) instruction google translated 

# Training time
18 hrs to train on NVIDIA RTX A6000 48GB with batch size of 30
 
## Kaggle demo link
https://www.kaggle.com/code/hemanthkumar21/tamil-mistral-instruct-v0-1-demo/

```python
from transformers import  (AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig,TextStreamer,pipeline)
import torch
model_name = "Hemanth-thunder/Tamil-Mistral-7B-Instruct-v0.1"
nf4_config = BitsAndBytesConfig(load_in_4bit=True,bnb_4bit_quant_type="nf4",bnb_4bit_use_double_quant=True,
   bnb_4bit_compute_dtype=torch.bfloat16
)
model = AutoModelForCausalLM.from_pretrained(model_name,device_map='auto',quantization_config=nf4_config,use_cache=False,low_cpu_mem_usage=True )
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"
streamer = TextStreamer(tokenizer)
pipe = pipeline("text-generation" ,model=model, tokenizer=tokenizer ,do_sample=True, repetition_penalty=1.15,top_p=0.95,streamer=streamer)
prompt = create_prompt("வாழ்க்கையில் ஆரோக்கியமாக இருப்பது எப்படி?")
result=pipe(prompt,max_length=512,pad_token_id=tokenizer.eos_token_id)

```
```
result:
- உடற்பயிற்சி - ஆரோக்கியமான உணவை உண்ணுங்கள் -2 புகைபிடிக்காதே - தவறாமல் உடற்பயிற்சி செய்</s>
```
## Instruction format

To harness the power of instruction fine-tuning, your prompt must be encapsulated within <s> and </s> tokens. This instructional format revolves around three key elements: Instruction, Input, and Response. The Tamil Mistral instruct model is adept at engaging in conversations based on this structured template.
E.g.


```
# without Input
prompt_template =<s>"""சரியான பதிலுடன் வேலையை வெற்றிகரமாக முடிக்க, தேவையான தகவலை உள்ளிடவும்.

### Instruction:
{}

### Response:"""

# with Input
prompt_template =<s>"""சரியான பதிலுடன் வேலையை வெற்றிகரமாக முடிக்க, வழங்கப்பட்ட வழிகாட்டுதல்களைப் பின்பற்றி, தேவையான தகவலை உள்ளிடவும்.

### Instruction:
{}

### Input:
{}

### Response:"""

```

## Python function to format query
```python
def create_prompt(query,prompt_template=prompt_template):
    bos_token = "<s>"
    eos_token = "</s>"
    if query:
      prompt_template = prompt_template.format(query)
    else:
      raise "Please input with query"
    prompt = bos_token+prompt_template #eos_token
    return prompt
```

## Demo Video
![row01](Tamil_llm.mp4)


## Model Architecture
This instruction model is based on Mistral-7B-v0.1, a transformer model with the following architecture choices:
- Grouped-Query Attention
- Sliding-Window Attention
- Byte-fallback BPE tokenizer

## Troubleshooting
- If you see the following error:
```
Traceback (most recent call last):
File "", line 1, in
File "/transformers/models/auto/auto_factory.py", line 482, in from_pretrained
config, kwargs = AutoConfig.from_pretrained(
File "/transformers/models/auto/configuration_auto.py", line 1022, in from_pretrained
config_class = CONFIG_MAPPING[config_dict["model_type"]]
File "/transformers/models/auto/configuration_auto.py", line 723, in getitem
raise KeyError(key)
KeyError: 'mistral'
```

Installing transformers from source should solve the issue
pip install git+https://github.com/huggingface/transformers

This should not be required after transformers-v4.33.4.

## Limitations

The Mistral 7B Instruct model is a quick demonstration that the base model can be easily fine-tuned to achieve compelling performance. 
It does not have any moderation mechanisms. We're looking forward to engaging with the community on ways to
make the model finely respect guardrails, allowing for deployment in environments requiring moderated outputs.


## Quantized Versions:
coming s00n

# How to Cite

```bibtext
@misc{Tamil-Mistral-7B-Instruct-v0.1, 
      url={[https://huggingface.co/Hemanth-thunder/Tamil-Mistral-7B-Instruct-v0.1]https://huggingface.co/Hemanth-thunder/Tamil-Mistral-7B-Instruct-v0.1)}, 
      title={Tamil-Mistral-7B-Instruct-v0.1}, 
      author={"hemanth kumar"}
}
```