metadata

license: apache-2.0
language:
  - hi
  - en
metrics:
  - perplexity
widget:
  - text: BCCI ने टी-20 वर्ल्ड कप के बीच जिम्बाब्वे सीरीज
    example_title: Example 1
  - text: 7 अक्टूबर को हमास से जंग शुरू होने के सात महीने बाद इजरायली सेना
    example_title: Example 2
  - text: नेता जी ने 5 जुलाई 1943 को सिंगापुर के टाउन हाल के सामने
    example_title: Example 3
  - text: पहले संदिग्ध मामलों को 31 दिसंबर 2019 को WHO को सूचित किया गया था,
    example_title: Example 4
  - text: 13 समन्वित बम विस्फोटों के बाद से मुंबई में कई गैर-राज्य हमले
    example_title: Example 5
  - text: 'गोधरा रेलवे स्टेशन के पास साबरमती ट्रेन के एस-6 कोच में मुस्लिमों '
    example_title: Example 6
  - text: 2007 टूर्नामेंट में क्रिकट विश्व कप के लिए टिकटों से सबसे ज्यादा आमदनी हुई
    example_title: Example 7

Model Card for Ganga-1b! 🌊

The base model Ganga-1b trained on a monolingual Hindi language dataset as part of Project Unity.
(The first pre-trained Hindi model by any academic research lab in India 🇮🇳!)*

Model Details

Model Description 📚

Project Unity is an initiative aimed at addressing India's linguistic diversity and richness by creating a comprehensive resource that covers the country's major languages. Our goal is to achieve state-of-the-art performance in understanding and generating text in Indian languages. To achieve this, we train models on the monolingual regional languages of India. Our first release is the Ganga-1B model, which has been trained on a large dataset of public domain web-crawled hindi language data, including news articles, web documents, books, government publications, educational materials, and social media conversations (filtered for quality). Additionally, the dataset has been further curated by native Indian speakers to ensure high-quality. Importantly, the Ganga-1B model outperforms existing open-source models that support Indian languages, even at sizes of up to 7 billion parameters.

Developed by: Lingo Research Labs at IIT Gandhinagar
Model type: Autoregressive Language Model
Language(s) (NLP): Bilingual (Primary: Hindi [hi], Secondary: English [en])
License: Apache 2.0

How to Get Started with the Model 👨🏻‍💻

Use the code below to get started with the model.

from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
    
tokenizer = AutoTokenizer.from_pretrained("LingoIITGN/ganga-1b")
model = AutoModelForCausalLM.from_pretrained(
    "LingoIITGN/ganga-1b",
    device_map="auto"
)

pipe = pipeline(task="text-generation", 
                model=model, 
                tokenizer=tokenizer,
                max_new_tokens = 5, 
                temperature = 0.70,
               )
result = pipe(prompt, pad_token_id=pipe.tokenizer.eos_token_id)
print(result)

Technical Specifications 🤖

Precision: Float32
Context Length: 2,048
Learning Rate: 4e-4
Optimizer: AdamW
LR Scheduler: Cosine

Model Architecture and Objective

Ganga-1b is a decoder-only transformer model, featuring the following specifications:

Layers: 16
Attention heads: 32
Embedding dimension: 2,048
Vocabulary size: 30,000
Sliding window: 512
Intermediate dimension: 7,168

Evaluation

[More Information Needed]

Results 🏆

Tokenizers Results

Model	Fertility
Ganga-1b	1.12
Pragna-1b	1.58
Bloom-1b1	1.27
Bloom-1b7	1.27
Gemma-2b	1.89
Bloom-3b	1.27
Airavata-7b	1.69

Metrics

Model	PPL_{Our Dataset}	PPL_{Sangraha Dataset}
Ganga-1b	17.92	15.82
Pragna-1b	98.16	9.37
Bloom-1b1	27.81	17.49
Bloom-1b7	22.49	14.28
Gemma-2b	49.27	31.01
Bloom-3b	19.99	12.82
OpenHathi-7B	42.95	25.73
Airavata-7b	60.87	38.24

Summary

Bias, Risks, and Limitations 🚨

[More Information Needed]

Recommendations ‼️

This model described is a research preview and is under ongoing iterative updations, and as such, it only provides limited safety measures. Additionally, it may generate offensive content. It is strictly prohibited to use this service for any illegal, harmful, violent, racist, or sexual purposes.

Model Card Contact ✉️

Lingo Research Labs at IIT Gandhinagar, India
Mail at: [email protected]