Traing data composition
#12
by
namespace-Pt
- opened
Thank you for sharing such a nice model! Tried it myself and the performance is good.
What does it mean by we generate long contexts by augmenting SlimPajama
? Do you only use pure pre-training data? Or do you curate some instructing tuning data based on Slimpajama?
Thanks for your comment. The majority of the token composition is indeed pre-training style data. Creating better long context synthetic datasets could be an interesting follow-up work (currently limited by OSS models having to low of a context to produce such datasets, and the fact that most human written books are <= 100K context length)