rednote-hilab/dots.llm1.inst · Seeking Insights on the Composition of the dots.llm1 Pre-training Data

I understand that the dots.llm1 model was pre-trained on a dataset consisting of 11.2T high-quality natural tokens, with no synthetic data included. The dataset covers both English and Chinese languages.

To better understand this model, I am interested in learning more about the composition of the pre-training dataset. Specifically, I am curious about what categories of data are included in the pre-training dataset and their respective proportions.

Looking forward to additional insights or details.

Thank you very much!