File size: 2,848 Bytes
844112b 3df4d03 5c70b19 3df4d03 5c70b19 3df4d03 5c70b19 3df4d03 5c70b19 0b0363f 5c70b19 3df4d03 5c70b19 3df4d03 5c70b19 3df4d03 5c70b19 3df4d03 5c70b19 3df4d03 5c70b19 3df4d03 5c70b19 3df4d03 5c70b19 3df4d03 5c70b19 3df4d03 5c70b19 3df4d03 5c70b19 3df4d03 5c70b19 3df4d03 5c70b19 3df4d03 5c70b19 3df4d03 5c70b19 3df4d03 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 |
---
license: mit
---
This model is a part of two model series, AryaBhatta-1 and AryaBhatta-2 and is finetuned from HuggingFaceH4/zephyr-7b-gemma-v0.1 or Google/gemma and is finetuned on 9 Indian languages (Hindi, Tamil, Punjabi, Bengali, Gujarati, Oriya, Telugu, Kannada, Malayalam) plus English.
There are two models. One finetuned on Google's Gemma and one fine-tuned on Zephyr's Gemma base. Repo for other one (Zephyr one): GenVRadmin/AryaBhatta-GemmaOrca-2-Merged
To improve the resoning and maths skills, we first SFT tune the gemma on Microsoft's Orca datasets.
We utilize Orca maths Hindi dataset: GenVRadmin/Aryabhatta-Orca-Maths-Hindi \
And original Orca maths dataset: microsoft/orca-math-word-problems-200k
This pushes the MATHS score from 24.3 in Gemma-7B to 25.5 in Zephyr-Gemma and 31.6 in GemmaOrca.
The model is then finetuned on GenVR's Samvaad datasets (GenVRadmin/Samvaad-Indic-Positive and GenVRadmin/Samvaad-Tamil-Mixtral and a subset of GenVRadmin/Samvaad-Mixed-Language-3).
This is then finetuned on various open sourced datasets like:
Telugu-LLM-Labs/yahma_alpaca_cleaned_telugu_filtered_and_romanized \
Telugu-LLM-Labs/teknium_GPTeacher_general_instruct_telugu_filtered_and_romanized \
abhinand/tamil-alpaca \
Tensoic/airoboros-3.2_kn \
Tensoic/gpt-teacher_kn \
Tensoic/Alpaca-Gujarati \
HydraIndicLM/bengali_alpaca_dolly_67k \
Open-Orca/OpenOrca \
pankajmathur/alpaca_orca \
OdiaGenAI/Odia_Alpaca_instructions_52k \
OdiaGenAI/gpt-teacher-roleplay-odia-3k \
GenVRadmin/Samvaad-Punjabi-Mini \
pankajmathur/WizardLM_Orca
The model achieves following scores on benchmarks:
Model AGIEval GPT4All TruthfulQA BigBench Average ⬇️ \
AryaBhatta-GemmaOrca 35.9 72.26 53.85 40.35 50.59 \
zephyr-7b-beta 37.52 71.77 55.26 39.77 51.08 \
zephyr-7b-gemma-v0.1 34.22 66.37 52.19 37.10 47.47 \
mlabonne/Gemmalpaca-7B 21.6 40.87 44.85 30.49 34.45 \
google/gemma-7b-it 21.33 40.84 41.70 30.25 33.53
How to use:-
```
from peft import AutoPeftModelForCausalLM
from transformers import AutoTokenizer
model = AutoPeftModelForCausalLM.from_pretrained(
"GenVRadmin/AryaBhatta-GemmaOrca",
load_in_4bit = False,
token = hf_token
)
tokenizer = AutoTokenizer.from_pretrained("GenVRadmin/AryaBhatta-GemmaOrca")
input_prompt = """
### Instruction:
{}
### Input:
{}
### Response:
{}"""
input_text = input_prompt.format(
"Answer this question about India.", # instruction
"Who is the Prime Minister of India", # input
"", # output - leave this blank for generation!
)
inputs = tokenizer([input_text], return_tensors = "pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens = 300, use_cache = True)
response = tokenizer.batch_decode(outputs)[0]
``` |