File size: 2,848 Bytes
844112b
3df4d03
5c70b19
3df4d03
5c70b19
 
3df4d03
5c70b19
3df4d03
5c70b19
0b0363f
5c70b19
3df4d03
 
5c70b19
3df4d03
5c70b19
3df4d03
5c70b19
3df4d03
5c70b19
3df4d03
 
 
 
 
 
 
 
 
 
 
 
 
5c70b19
3df4d03
5c70b19
3df4d03
 
 
 
 
 
5c70b19
 
 
 
 
3df4d03
 
 
 
5c70b19
3df4d03
 
 
 
 
 
5c70b19
3df4d03
 
 
5c70b19
3df4d03
 
5c70b19
3df4d03
 
5c70b19
3df4d03
 
 
 
 
5c70b19
3df4d03
5c70b19
3df4d03
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
---
license: mit

---


This model is a part of two model series, AryaBhatta-1 and AryaBhatta-2 and is finetuned from HuggingFaceH4/zephyr-7b-gemma-v0.1 or Google/gemma and is finetuned on 9 Indian languages (Hindi, Tamil, Punjabi, Bengali, Gujarati, Oriya, Telugu, Kannada, Malayalam) plus English.

There are two models. One finetuned on Google's Gemma and one fine-tuned on Zephyr's Gemma base. Repo for other one (Zephyr one): GenVRadmin/AryaBhatta-GemmaOrca-2-Merged

To improve the resoning and maths skills, we first SFT tune the gemma on Microsoft's Orca datasets.

We utilize Orca maths Hindi dataset: GenVRadmin/Aryabhatta-Orca-Maths-Hindi \
And original Orca maths dataset: microsoft/orca-math-word-problems-200k 

This pushes the MATHS score from 24.3 in Gemma-7B to 25.5 in Zephyr-Gemma and 31.6 in GemmaOrca.

The model is then finetuned on GenVR's Samvaad datasets (GenVRadmin/Samvaad-Indic-Positive and GenVRadmin/Samvaad-Tamil-Mixtral and a subset of GenVRadmin/Samvaad-Mixed-Language-3).

This is then finetuned on various open sourced datasets like:

Telugu-LLM-Labs/yahma_alpaca_cleaned_telugu_filtered_and_romanized \
Telugu-LLM-Labs/teknium_GPTeacher_general_instruct_telugu_filtered_and_romanized \
abhinand/tamil-alpaca \
Tensoic/airoboros-3.2_kn \
Tensoic/gpt-teacher_kn \
Tensoic/Alpaca-Gujarati \
HydraIndicLM/bengali_alpaca_dolly_67k \
Open-Orca/OpenOrca \
pankajmathur/alpaca_orca \
OdiaGenAI/Odia_Alpaca_instructions_52k \
OdiaGenAI/gpt-teacher-roleplay-odia-3k \
GenVRadmin/Samvaad-Punjabi-Mini \
pankajmathur/WizardLM_Orca 

The model achieves following scores on benchmarks:

Model	                 AGIEval	GPT4All	TruthfulQA	BigBench	Average ⬇️ \
AryaBhatta-GemmaOrca      35.9       72.26   53.85       40.35       50.59 \
zephyr-7b-beta	          37.52	     71.77	 55.26	     39.77	     51.08 \
zephyr-7b-gemma-v0.1	  34.22	     66.37	 52.19	     37.10	     47.47 \
mlabonne/Gemmalpaca-7B	  21.6	     40.87	 44.85	     30.49	     34.45 \
google/gemma-7b-it	      21.33	     40.84	 41.70	     30.25	     33.53 





How to use:-
```
from peft import AutoPeftModelForCausalLM
from transformers import AutoTokenizer

model = AutoPeftModelForCausalLM.from_pretrained(
    "GenVRadmin/AryaBhatta-GemmaOrca",
    load_in_4bit = False,
    token = hf_token
)
tokenizer = AutoTokenizer.from_pretrained("GenVRadmin/AryaBhatta-GemmaOrca")

input_prompt = """
### Instruction:
{}

### Input:
{}

### Response:
{}"""

input_text = input_prompt.format(
        "Answer this question about India.", # instruction
        "Who is the Prime Minister of India", # input
        "", # output - leave this blank for generation!
    )

inputs = tokenizer([input_text], return_tensors = "pt").to("cuda")

outputs = model.generate(**inputs, max_new_tokens = 300, use_cache = True)
response = tokenizer.batch_decode(outputs)[0]
```