Update README.md
Browse files
README.md
CHANGED
@@ -5,9 +5,8 @@ license: other
|
|
5 |
|
6 |
Update (Aug 15, 2024): You can now get started with text completions and supervised finetuning using [this notebook](https://colab.research.google.com/drive/1IZ-KJgzRAMr4Rm_-OWvWwnfTQwRxOknp?usp=sharing) on Google colab!
|
7 |
|
8 |
-
This is an early checkpoint of `sarvam-2b`, a small, yet powerful language model pre-trained from scratch on 2 trillion tokens. It is trained to be good at 10 Indic languages + English. Officially, the Indic languages supported are: Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil, and Telugu.
|
9 |
-
|
10 |
-
sarvam-2b will be trained on a data mixture of 4 trillion tokens: containing equal parts English (2T) and Indic (2T) tokens.
|
11 |
|
12 |
The current checkpoint has not undergone any post-training. You can see the capabilities of the current checkpoint in [this video](https://www.youtube.com/watch?v=DFtAS1BCKvk).
|
13 |
|
|
|
5 |
|
6 |
Update (Aug 15, 2024): You can now get started with text completions and supervised finetuning using [this notebook](https://colab.research.google.com/drive/1IZ-KJgzRAMr4Rm_-OWvWwnfTQwRxOknp?usp=sharing) on Google colab!
|
7 |
|
8 |
+
This is an early checkpoint of `sarvam-2b`, a small, yet powerful language model pre-trained from scratch on 2 trillion tokens. It is trained to be good at 10 Indic languages + English. Officially, the Indic languages supported are: Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil, and Telugu.
|
9 |
+
`sarvam-2b` will eventually be trained on a data mixture of 4 trillion tokens: containing equal parts English (2T) and Indic (2T) tokens.
|
|
|
10 |
|
11 |
The current checkpoint has not undergone any post-training. You can see the capabilities of the current checkpoint in [this video](https://www.youtube.com/watch?v=DFtAS1BCKvk).
|
12 |
|