|
--- |
|
license: mit |
|
--- |
|
|
|
This model is finetuned from HuggingFaceH4/zephyr-7b-gemma-v0.1 and is finetuned on 9 Indian languages (Hindi, Tamil, Punjabi, Bengali, Gujarati, Oriya, Telugu, Kannada, Malayalam) plus English. |
|
To improve the resoning and maths skills, we first SFT tune the gemma on Microsoft's Orca datasets. |
|
|
|
We utilize Orca maths Hindi dataset: GenVRadmin/Aryabhatta-Orca-Maths-Hindi \ |
|
And original Orca maths dataset: microsoft/orca-math-word-problems-200k |
|
|
|
This pushes the MATHS score from 24.3 in Gemma-7B to 25.5 in Zephyr-Gemma and 31.6 in GemmaOrca. |
|
|
|
The model is then finetuned on GenVR's Samvaad datasets (GenVRadmin/Samvaad-Indic-Positive and GenVRadmin/Samvaad-Tamil-Mixtral and a subset of GenVRadmin/Samvaad-Mixed-Language-3). |
|
|
|
This is then finetuned on various open sourced datasets like: |
|
|
|
Telugu-LLM-Labs/yahma_alpaca_cleaned_telugu_filtered_and_romanized \ |
|
Telugu-LLM-Labs/teknium_GPTeacher_general_instruct_telugu_filtered_and_romanized \ |
|
abhinand/tamil-alpaca \ |
|
Tensoic/airoboros-3.2_kn \ |
|
Tensoic/gpt-teacher_kn \ |
|
Tensoic/Alpaca-Gujarati \ |
|
HydraIndicLM/bengali_alpaca_dolly_67k \ |
|
Open-Orca/OpenOrca \ |
|
pankajmathur/alpaca_orca \ |
|
OdiaGenAI/Odia_Alpaca_instructions_52k \ |
|
OdiaGenAI/gpt-teacher-roleplay-odia-3k \ |
|
GenVRadmin/Samvaad-Punjabi-Mini \ |
|
pankajmathur/WizardLM_Orca |
|
|
|
The model achieves following scores on benchmarks: |
|
|
|
Model AGIEval GPT4All TruthfulQA BigBench Average ⬇️ \ |
|
AryaBhatta-GemmaOrca 35.9 72.26 53.85 40.35 50.59 \ |
|
zephyr-7b-beta 37.52 71.77 55.26 39.77 51.08 \ |
|
zephyr-7b-gemma-v0.1 34.22 66.37 52.19 37.10 47.47 \ |
|
mlabonne/Gemmalpaca-7B 21.6 40.87 44.85 30.49 34.45 \ |
|
google/gemma-7b-it 21.33 40.84 41.70 30.25 33.53 |
|
|
|
|
|
|
|
How to use:- |
|
``` |
|
from peft import AutoPeftModelForCausalLM |
|
from transformers import AutoTokenizer |
|
|
|
model = AutoPeftModelForCausalLM.from_pretrained( |
|
"GenVRadmin/AryaBhatta-GemmaOrca", |
|
load_in_4bit = False, |
|
token = hf_token |
|
) |
|
tokenizer = AutoTokenizer.from_pretrained("GenVRadmin/AryaBhatta-GemmaOrca") |
|
|
|
input_prompt = """ |
|
### Instruction: |
|
{} |
|
|
|
### Input: |
|
{} |
|
|
|
### Response: |
|
{}""" |
|
|
|
input_text = input_prompt.format( |
|
"Answer this question about India.", # instruction |
|
"Who is the Prime Minister of India", # input |
|
"", # output - leave this blank for generation! |
|
) |
|
|
|
inputs = tokenizer([input_text], return_tensors = "pt").to("cuda") |
|
|
|
outputs = model.generate(**inputs, max_new_tokens = 300, use_cache = True) |
|
response = tokenizer.batch_decode(outputs)[0] |
|
``` |