Seeking Insights on the Composition of the dots.llm1 Pre-training Data
#10
by
AshleyLL
- opened
I understand that the dots.llm1 model was pre-trained on a dataset consisting of 11.2T high-quality natural tokens, with no synthetic data included. The dataset covers both English and Chinese languages.
To better understand this model, I am interested in learning more about the composition of the pre-training dataset. Specifically, I am curious about what categories of data are included in the pre-training dataset and their respective proportions.
Looking forward to additional insights or details.
Thank you very much!