Traing data composition

#12
by namespace-Pt - opened

Thank you for sharing such a nice model! Tried it myself and the performance is good.

What does it mean by we generate long contexts by augmenting SlimPajama? Do you only use pure pre-training data? Or do you curate some instructing tuning data based on Slimpajama?

Gradient AI org

Thanks for your comment. The majority of the token composition is indeed pre-training style data. Creating better long context synthetic datasets could be an interesting follow-up work (currently limited by OSS models having to low of a context to produce such datasets, and the fact that most human written books are <= 100K context length)

Sign up or log in to comment