license: mit
This model is finetuned from HuggingFaceH4/zephyr-7b-gemma-v0.1 and is finetuned on 9 Indian languages (Hindi, Tamil, Punjabi, Bengali, Gujarati, Oriya, Telugu, Kannada, Malayalam) plus English. To improve the resoning and maths skills, we first SFT tune the gemma on Microsoft's Orca datasets.
We utilize Orca maths Hindi dataset: GenVRadmin/Aryabhatta-Orca-Maths-Hindi And original Orca maths dataset: microsoft/orca-math-word-problems-200k
This pushes the MATHS score from 24.3 in Gemma-7B to 25.5 in Zephyr-Gemma and 31.6 in GemmaOrca.
The model is then finetuned on GenVR's Samvaad datasets (GenVRadmin/Samvaad-Indic-Positive and GenVRadmin/Samvaad-Tamil-Mixtral and a subset of GenVRadmin/Samvaad-Mixed-Language-3).
This is then finetuned on various open sourced datasets like:
Telugu-LLM-Labs/yahma_alpaca_cleaned_telugu_filtered_and_romanized, Telugu-LLM-Labs/teknium_GPTeacher_general_instruct_telugu_filtered_and_romanized abhinand/tamil-alpaca Tensoic/airoboros-3.2_kn, Tensoic/gpt-teacher_kn Tensoic/Alpaca-Gujarati HydraIndicLM/bengali_alpaca_dolly_67k Open-Orca/OpenOrca pankajmathur/alpaca_orca OdiaGenAI/Odia_Alpaca_instructions_52k, OdiaGenAI/gpt-teacher-roleplay-odia-3k GenVRadmin/Samvaad-Punjabi-Mini pankajmathur/WizardLM_Orca
The model achieves following scores on benchmarks:
Model AGIEval GPT4All TruthfulQA BigBench Average ⬇️ AryaBhatta-GemmaOrca 39.9 74.26 58.85 43.35 54.09 zephyr-7b-beta 37.52 71.77 55.26 39.77 51.08 zephyr-7b-gemma-v0.1 34.22 66.37 52.19 37.10 47.47 mlabonne/Gemmalpaca-7B 21.6 40.87 44.85 30.49 34.45 google/gemma-7b-it 21.33 40.84 41.70 30.25 33.53