Spaces:
Running
Datasets
Hello SmolTuners!
As in description, our main mission is to focus on creating 'small llms' which can be usable to more specific tasks. To do this, we definitely have to focus on dataset which are capable to give as an additive value. I opened this discussion to gather and noted some datasets worth to look for, cause it hase to be starter point to ft (despite of quantization and model merging). Have a great day <3
https://huggingface.co/datasets/HuggingFaceTB/smoltalk
was used to finetune SmolLM2 - could be worth a look at, I'd probably filter this for math though.
A thing i've noticed using alot of smaller models is that most often then not, new pretrains of smaller models are not usually the way to go
Instead it's better to finetune upon distilled models such as nvidia/Llama-3.1-Minitron-4B-Width-Base or google/gemma-2-2b-it
Ill have some writeups on my way, ill post an update today evening, lets go !