rahular commited on
Commit
1699d55
·
verified ·
1 Parent(s): be3bdb1

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +2 -3
README.md CHANGED
@@ -5,9 +5,8 @@ license: other
5
 
6
  Update (Aug 15, 2024): You can now get started with text completions and supervised finetuning using [this notebook](https://colab.research.google.com/drive/1IZ-KJgzRAMr4Rm_-OWvWwnfTQwRxOknp?usp=sharing) on Google colab!
7
 
8
- This is an early checkpoint of `sarvam-2b`, a small, yet powerful language model pre-trained from scratch on 2 trillion tokens. It is trained to be good at 10 Indic languages + English. Officially, the Indic languages supported are: Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil, and Telugu.
9
-
10
- sarvam-2b will be trained on a data mixture of 4 trillion tokens: containing equal parts English (2T) and Indic (2T) tokens.
11
 
12
  The current checkpoint has not undergone any post-training. You can see the capabilities of the current checkpoint in [this video](https://www.youtube.com/watch?v=DFtAS1BCKvk).
13
 
 
5
 
6
  Update (Aug 15, 2024): You can now get started with text completions and supervised finetuning using [this notebook](https://colab.research.google.com/drive/1IZ-KJgzRAMr4Rm_-OWvWwnfTQwRxOknp?usp=sharing) on Google colab!
7
 
8
+ This is an early checkpoint of `sarvam-2b`, a small, yet powerful language model pre-trained from scratch on 2 trillion tokens. It is trained to be good at 10 Indic languages + English. Officially, the Indic languages supported are: Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil, and Telugu.
9
+ `sarvam-2b` will eventually be trained on a data mixture of 4 trillion tokens: containing equal parts English (2T) and Indic (2T) tokens.
 
10
 
11
  The current checkpoint has not undergone any post-training. You can see the capabilities of the current checkpoint in [this video](https://www.youtube.com/watch?v=DFtAS1BCKvk).
12