File size: 6,038 Bytes
0d058d4
 
c0a7a79
 
9270043
c0a7a79
 
1dbc86d
9270043
b4774e5
9270043
 
b4774e5
9270043
f22028b
401176a
f22028b
 
b4774e5
f22028b
 
b4774e5
f22028b
 
401176a
f22028b
 
b4774e5
f22028b
0d058d4
 
84adc80
0d058d4
edb46e9
d02e75f
1ed527e
0d058d4
 
edb46e9
0d058d4
46be83d
 
4a18946
0d058d4
b60f15a
 
 
be0f64c
0d058d4
 
a4937e8
46be83d
b2a80d2
0d058d4
 
 
 
4a18946
0d058d4
 
 
6dd3352
49af7a3
9270043
 
49af7a3
 
 
9da5f4b
 
 
 
 
 
49af7a3
 
9270043
 
0d058d4
4a18946
0d058d4
a45c8cb
 
 
 
 
 
0568c4d
4007194
 
0568c4d
4007194
 
0568c4d
 
 
 
 
 
4007194
 
0d058d4
eb8d22d
0d058d4
4a18946
0d058d4
8fc0b91
 
 
 
 
 
8cbc89c
 
 
 
 
 
 
21c751c
8fc0b91
 
 
 
 
 
 
 
8117dab
0d058d4
8cbc89c
4f727ac
 
 
 
 
 
 
21c751c
8fc0b91
 
0d058d4
 
4a18946
0d058d4
 
788513d
4a18946
ff6a0a0
788513d
4a18946
0568c4d
e3ba2f0
788513d
8d6ec54
788513d
d9abc1d
0d058d4
4a18946
0d058d4
9f74d54
9270043
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
---
license: apache-2.0
language:
- hi
- en
metrics:
- perplexity
widget:
- text: >-
    BCCI ने टी-20 वर्ल्ड कप के बीच जिम्बाब्वे सीरीज
  example_title: Example 1
- text: >-
    7 अक्टूबर को हमास से जंग शुरू होने के सात महीने बाद इजरायली सेना
  example_title: Example 2
- text: >-
    हवा में अवांछित गैसों की उपस्थिति से मनुष्य, पशुओं तथा पक्षियों को
  example_title: Example 3
- text: >-
    पहले संदिग्ध मामलों को 31 दिसंबर 2019 को WHO को सूचित किया गया था,
  example_title: Example 4
- text: >-
    13 समन्वित बम विस्फोटों के बाद से मुंबई में कई गैर-राज्य हमले
  example_title: Example 5
- text: >-
     निकोला टेस्ला का जन्म 10 जुलाई 1856 को स्किमडज़, क्रोएरिया में हुआ था, 
  example_title: Example 6
- text: >-
    2007 टूर्नामेंट में क्रिकट विश्व कप के लिए टिकटों से सबसे ज्यादा आमदनी हुई
  example_title: Example 7
---

# Model Card for Ganga-1b! 🌊

The base model **``Ganga-1b``** trained on a monolingual **Hindi** language dataset as part of ***Project Unity***. We propose the name *Ganga* 🌊 to honor the longest river flowing through the Hindi-speaking region of India 🇮🇳.

*(The first pre-trained Hindi model by any academic research lab in India 🇮🇳!)**


![image/png](https://cdn-uploads.huggingface.co/production/uploads/667b8f8ba271fc5a8e6929de/jG3tZnGPvH6vcGrvxO-YC.png)



### Model Description 📚

**Project Unity** is an initiative to address **India's linguistic diversity** and richness by creating a comprehensive resource covering the country's major languages. We strive to achieve state-of-the-art performance in understanding and generating text in **Indian languages**. 
To achieve this, we train models on the monolingual regional languages of India. Our first release is the *Ganga-1B* model, *which has been trained on a large dataset of public domain web-crawled Hindi language data, including news articles, web documents, books, government publications, educational materials, and social media conversations (filtered for quality)*. Additionally, the dataset has been further curated by native Indian speakers to ensure high quality. 
Significantly, the **Ganga-1B** model outperforms existing open-source models that support **Indian languages**, even at sizes of up to **7 billion parameters**. 



- **Developed by:** [Lingo Research Group at IIT Gandhinagar](https://labs.iitgn.ac.in/lingo/) 
- **Model type:** Autoregressive Language Model
- **Language(s) (NLP):** Bilingual (Primary: *Hindi* [**hi**], Secondary: *English* [**en**])
- **License:** Apache 2.0



## How to Get Started with the Model 👨🏻‍💻

Use the code below to get started with the model.

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
    
tokenizer = AutoTokenizer.from_pretrained("LingoIITGN/ganga-1b")
model = AutoModelForCausalLM.from_pretrained("LingoIITGN/ganga-1b", device_map="auto")

input_text = "BCCI ने टी-20 वर्ल्ड कप के बीच जिम्बाब्वे सीरीज "
input_ids = tokenizer.encode(input_text,
            return_tensors="pt").to("cuda")

outputs = model.generate(input_ids, max_new_tokens=100,
          do_sample=True, top_k=50,
          top_p=0.95, temperature=0.7)

print(tokenizer.decode(outputs[0]))

```

## Technical Specifications 🤖

- **Precision**: *Float32*
- **Context Length**: *2,048*
- **Learning Rate**: *4e-4*
- **Optimizer**: *AdamW*
- **LR Scheduler**: *Cosine*

### Model Architecture and Objective


Ganga-1b is a decoder-only transformer model, featuring the following specifications:


* Layers: 16
* Attention heads: 32
* Embedding dimension: 2,048
* Vocabulary size: 30,000
* Sliding window: 512
* Intermediate dimension: 7,168


## Evaluation
[More Information Needed]

### Results 🏆

<details open>
<summary>Tokenizers Results</summary>
<br>
  
|    Model    | Fertility |
|:-----------:|:---------:|
|   ***Ganga-1b***  |    ***1.12***   |
|  Pragna-1b  |    1.58   |
|  Bloom-1b1  |    1.27   |
|  Bloom-1b7  |    1.27   |
|   Gemma-2b  |    1.89   |
|   Bloom-3b  |    1.27   |
| Airavata-7b |    1.69   |
| Sarvam-2b   |    1.38   |

</details>


<details open>
<summary>Metrics</summary>
<br>
  
|    Model    | PPL<sub>Our Dataset</sub> |   PPL<sub>Sangraha Dataset</sub>  |
|:-----------:|:---------:|:------:|
|   ***Ganga-1b***  |  ***17.92***     |  ***15.82*** |
|  Pragna-1b  |  98.16     |  9.37  |
|  Bloom-1b1  |  27.81     |  17.49 |
|  Bloom-1b7  |  22.49     |  14.28 |
|   Gemma-2b  |  49.27     |  31.01 |
|   Bloom-3b  |  19.99     |  12.82 |
| OpenHathi-7B | 42.95      | 25.73 |
| Airavata-7b |  60.87     |  38.24 |
| Sarvam-2b   |  18.56     |  10.31 |

</details>


## Summary



## Bias, Risks, and Limitations 🚨


### Recommendations ‼️

<span style="color:red">This model described is a research preview and is under ongoing iterative updations, and as such, it only provides limited safety measures. Additionally, it may generate offensive content. It is strictly prohibited to use the model for any illegal, harmful, violent, racist, or sexual purposes.</span>

## More Information

**DEMO:** [https://huggingface.co/spaces/Lingo-IITGN/ganga-1b](https://huggingface.co/spaces/Lingo-IITGN/ganga-1b)

## Model Card Contact ✉️

[Lingo Research Group at IIT Gandhinagar, India](https://labs.iitgn.ac.in/lingo/) </br> 
Mail at: [[email protected]]([email protected])