File size: 5,073 Bytes
0d058d4 c0a7a79 9270043 c0a7a79 1dbc86d 9270043 0d058d4 84adc80 0d058d4 46be83d d02e75f 8c2fa52 7f077ac 0d058d4 46be83d 0d058d4 9270043 46be83d b2a80d2 0d058d4 6dd3352 9270043 0d058d4 0568c4d 0d058d4 0568c4d 4007194 0568c4d 4007194 0568c4d 4007194 0d058d4 8fc0b91 f3240e9 0d058d4 8fc0b91 0d058d4 788513d 0568c4d 788513d 0568c4d 29cd36e 788513d 0d058d4 8fc0b91 9270043 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 |
---
license: apache-2.0
language:
- hi
- en
metrics:
- perplexity
widget:
- text: >-
BCCI ने टी-20 वर्ल्ड कप के बीच जिम्बाब्वे सीरीज के लिए टीम इंडिया का ऐलान कर
दिया है। इस टीम में कई नए चेहरों को जगह दी गई है।
example_title: Example 1
- text: >-
7 अक्टूबर को हमास से जंग शुरू होने के सात महीने बाद इजरायली सेना ने गाजा
पट्टी में हमास के ठिकानों पर हमला किया था। इस हमले में हमास के कई ठिकानों को
निशाना
example_title: Example 2
---
# Model Card for Ganga-1b! 🌊
The base model **``Ganga-1b``** trained on a monolingual **Hindi** language dataset as part of ***Project Unity***. <br> *(The first pre-trained Hindi model by any academic research lab in India 🇮🇳!)**
![image/png](https://cdn-uploads.huggingface.co/production/uploads/667b8f8ba271fc5a8e6929de/jG3tZnGPvH6vcGrvxO-YC.png)
## Model Details
### Model Description
Project Unity is an initiative aimed at addressing India's linguistic diversity and richness by creating a comprehensive resource that covers the country's major languages. Our goal is to achieve state-of-the-art performance in understanding and generating text in Indian languages. To achieve this, we train models on monolingual regional languages of India. Our first release is the Ganga-1B model, which has been trained on a large dataset of public domain web-crawled hindi language data, including news articles, web documents, books, government publications, educational materials, and social media conversations (filtered for quality). Additionally, the dataset has been further curated by native Indian speakers to ensure high-quality. Importantly, the Ganga-1B model outperforms existing open-source models that support Indian languages, even at sizes of up to 7 billion parameters. Designed to be compact and efficient, the model can easily run on edge devices, making it ideal for a range of applications that require generating human-like text. Its modest size also enables easy integration into resource-constrained environments, such as personal devices or cloud infrastructure, allowing for wider adoption and innovation in AI-driven technologies.
- **Developed by:** [Lingo Research Labs at IIT Gandhinagar](https://labs.iitgn.ac.in/lingo/)
- **Model type:** Autoregressive Language Model
- **Language(s) (NLP):** Bilingual (Primary: *Hindi* [**hi**], Secondary: *English* [**en**])
- **License:** Apache 2.0
## How to Get Started with the Model
Use the code below to get started with the model.
```python
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
tokenizer = AutoTokenizer.from_pretrained("LingoIITGN/ganga-1b")
model = AutoModelForCausalLM.from_pretrained(
"LingoIITGN/ganga-1b",
device_map="auto"
)
pipe = pipeline(task="text-generation",
model=model,
tokenizer=tokenizer,
max_new_tokens = 5,
temperature = 0.70,
)
result = pipe(prompt, pad_token_id=pipe.tokenizer.eos_token_id)
print(result)
```
## Technical Specifications
### Model Architecture and Objective
Ganga-1b is a decoder-only transformer model, featuring the following specifications:
* Layers: 16
* Attention heads: 32
* Embedding dimension: 2,048
* Vocabulary size: 30,000
* Sliding window: 512
* Intermediate dimension: 7,168
## Evaluation
### Results
<details open>
<summary>Tokenizers Results</summary>
<br>
| Model | Fertility |
|:-----------:|:---------:|
| ***ganga-1b*** | ***1.12*** |
| pragna-1b | 1.58 |
| bloom-1b1 | 1.27 |
| bloom-1b7 | 1.27 |
| gemma-2b | 1.89 |
| bloom-3b | 1.27 |
| airavata-7b | 1.69 |
</details>
<details open>
<summary>Metrics</summary>
<br>
| Model | PPL<sub>Ours</sub> | PPL<sub>Airawat</sub> |
|:-----------:|:---------:|:------:|
| ganga-1b | | 34.85 |
| pragna-1b | | 12.74 |
| bloom-1b1 | | 33.39 |
| bloom-1b7 | | 26.63 |
| gemma-2b | | 41.67 |
| bloom-3b | | 23.77 |
| airavata-7b | | 46.24 |
</details>
#### Summary
## Bias, Risks, and Limitations
### Recommendations
<span style="color:red">This model described is a research preview and is under ongoing iterative updations, and as such, it only provides limited safety measures. Additionally, it may generate offensive content. It is strictly prohibited to use this service for any illegal, harmful, violent, racist, or sexual purposes.</span>.
## Model Card Contact
[Lingo Research Labs at IIT Gandhinagar, India](https://labs.iitgn.ac.in/lingo/) </br>
Mail at: [[email protected]]([email protected]) |