File size: 5,972 Bytes
0d058d4 c0a7a79 9270043 c0a7a79 1dbc86d 9270043 b4774e5 9270043 b4774e5 9270043 f22028b 401176a f22028b b4774e5 f22028b b4774e5 f22028b 401176a f22028b b4774e5 f22028b 0d058d4 84adc80 0d058d4 edb46e9 d02e75f 1ed527e 0d058d4 edb46e9 0d058d4 46be83d 4a18946 0d058d4 b60f15a be0f64c 0d058d4 a4937e8 46be83d b2a80d2 0d058d4 4a18946 0d058d4 6dd3352 49af7a3 9270043 49af7a3 9da5f4b 49af7a3 9270043 0d058d4 4a18946 0d058d4 a45c8cb 0568c4d 4007194 0568c4d 4007194 0568c4d 4007194 0d058d4 eb8d22d 0d058d4 4a18946 0d058d4 8fc0b91 8cbc89c 8fc0b91 8117dab 0d058d4 8cbc89c 4f727ac 8fc0b91 0d058d4 4a18946 0d058d4 788513d 4a18946 ff6a0a0 788513d 4a18946 0568c4d e3ba2f0 788513d 8d6ec54 788513d 8d6ec54 0d058d4 4a18946 0d058d4 9f74d54 9270043 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 |
---
license: apache-2.0
language:
- hi
- en
metrics:
- perplexity
widget:
- text: >-
BCCI ने टी-20 वर्ल्ड कप के बीच जिम्बाब्वे सीरीज
example_title: Example 1
- text: >-
7 अक्टूबर को हमास से जंग शुरू होने के सात महीने बाद इजरायली सेना
example_title: Example 2
- text: >-
हवा में अवांछित गैसों की उपस्थिति से मनुष्य, पशुओं तथा पक्षियों को
example_title: Example 3
- text: >-
पहले संदिग्ध मामलों को 31 दिसंबर 2019 को WHO को सूचित किया गया था,
example_title: Example 4
- text: >-
13 समन्वित बम विस्फोटों के बाद से मुंबई में कई गैर-राज्य हमले
example_title: Example 5
- text: >-
निकोला टेस्ला का जन्म 10 जुलाई 1856 को स्किमडज़, क्रोएरिया में हुआ था,
example_title: Example 6
- text: >-
2007 टूर्नामेंट में क्रिकट विश्व कप के लिए टिकटों से सबसे ज्यादा आमदनी हुई
example_title: Example 7
---
# Model Card for Ganga-1b! 🌊
The base model **``Ganga-1b``** trained on a monolingual **Hindi** language dataset as part of ***Project Unity***. We propose the name *Ganga* 🌊 to honor the longest river flowing through the Hindi-speaking region of India 🇮🇳.
*(The first pre-trained Hindi model by any academic research lab in India 🇮🇳!)**
![image/png](https://cdn-uploads.huggingface.co/production/uploads/667b8f8ba271fc5a8e6929de/jG3tZnGPvH6vcGrvxO-YC.png)
### Model Description 📚
**Project Unity** is an initiative to address **India's linguistic diversity** and richness by creating a comprehensive resource covering the country's major languages. We strive to achieve state-of-the-art performance in understanding and generating text in **Indian languages**.
To achieve this, we train models on the monolingual regional languages of India. Our first release is the *Ganga-1B* model, *which has been trained on a large dataset of public domain web-crawled Hindi language data, including news articles, web documents, books, government publications, educational materials, and social media conversations (filtered for quality)*. Additionally, the dataset has been further curated by native Indian speakers to ensure high quality.
Significantly, the **Ganga-1B** model outperforms existing open-source models that support **Indian languages**, even at sizes of up to **7 billion parameters**.
- **Developed by:** [Lingo Research Group at IIT Gandhinagar](https://labs.iitgn.ac.in/lingo/)
- **Model type:** Autoregressive Language Model
- **Language(s) (NLP):** Bilingual (Primary: *Hindi* [**hi**], Secondary: *English* [**en**])
- **License:** Apache 2.0
## How to Get Started with the Model 👨🏻💻
Use the code below to get started with the model.
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("LingoIITGN/ganga-1b")
model = AutoModelForCausalLM.from_pretrained("LingoIITGN/ganga-1b", device_map="auto")
input_text = "BCCI ने टी-20 वर्ल्ड कप के बीच जिम्बाब्वे सीरीज "
input_ids = tokenizer.encode(input_text,
return_tensors="pt").to("cuda")
outputs = model.generate(input_ids, max_new_tokens=100,
do_sample=True, top_k=50,
top_p=0.95, temperature=0.7)
print(tokenizer.decode(outputs[0]))
```
## Technical Specifications 🤖
- **Precision**: *Float32*
- **Context Length**: *2,048*
- **Learning Rate**: *4e-4*
- **Optimizer**: *AdamW*
- **LR Scheduler**: *Cosine*
### Model Architecture and Objective
Ganga-1b is a decoder-only transformer model, featuring the following specifications:
* Layers: 16
* Attention heads: 32
* Embedding dimension: 2,048
* Vocabulary size: 30,000
* Sliding window: 512
* Intermediate dimension: 7,168
## Evaluation
[More Information Needed]
### Results 🏆
<details open>
<summary>Tokenizers Results</summary>
<br>
| Model | Fertility |
|:-----------:|:---------:|
| ***Ganga-1b*** | ***1.12*** |
| Pragna-1b | 1.58 |
| Bloom-1b1 | 1.27 |
| Bloom-1b7 | 1.27 |
| Gemma-2b | 1.89 |
| Bloom-3b | 1.27 |
| Airavata-7b | 1.69 |
</details>
<details open>
<summary>Metrics</summary>
<br>
| Model | PPL<sub>Our Dataset</sub> | PPL<sub>Sangraha Dataset</sub> |
|:-----------:|:---------:|:------:|
| ***Ganga-1b*** | ***17.92*** | ***15.82*** |
| Pragna-1b | 98.16 | 9.37 |
| Bloom-1b1 | 27.81 | 17.49 |
| Bloom-1b7 | 22.49 | 14.28 |
| Gemma-2b | 49.27 | 31.01 |
| Bloom-3b | 19.99 | 12.82 |
| OpenHathi-7B | 42.95 | 25.73 |
| Airavata-7b | 60.87 | 38.24 |
</details>
## Summary
## Bias, Risks, and Limitations 🚨
### Recommendations ‼️
<span style="color:red">This model described is a research preview and is under ongoing iterative updations, and as such, it only provides limited safety measures. Additionally, it may generate offensive content. It is strictly prohibited to use the model for any illegal, harmful, violent, racist, or sexual purposes.</span>
## More Information
**DEMO:** [https://huggingface.co/spaces/Lingo-IITGN/ganga-1b](https://huggingface.co/spaces/Lingo-IITGN/ganga-1b)
## Model Card Contact ✉️
[Lingo Research Group at IIT Gandhinagar, India](https://labs.iitgn.ac.in/lingo/) </br>
Mail at: [[email protected]]([email protected]) |