Traing data composition

#12

by namespace-Pt - opened Apr 27

Apr 27

Thank you for sharing such a nice model! Tried it myself and the performance is good.

What does it mean by we generate long contexts by augmenting SlimPajama? Do you only use pure pre-training data? Or do you curate some instructing tuning data based on Slimpajama?

michaelfeil

Gradient AI org Apr 29

Thanks for your comment. The majority of the token composition is indeed pre-training style data. Creating better long context synthetic datasets could be an interesting follow-up work (currently limited by OSS models having to low of a context to produce such datasets, and the fact that most human written books are <= 100K context length)

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment