How did you manage to get your GSM8K a full 1.9 percentage points up from a 15T token trained model?
IMO from looking at other models, the math scores on FT models usually does not go up by much.
How can you improve GSM8K on a pure FT of a 15T token trained model?
Forgive me if my skepticism is unwarranted, but I really don't know how one person (even with 12 years of both hard work and experience!) can improve a model which had its training data looked over by at the very least 40 others, reputable ones no less.
Call me both impressed, and raising my eyebrow when something like this happens. (As someone who's worked with experienced people who worked on FT deep neural network applications.)
Would love to see any test possible on this model. In fact, I would love to see this model being put through hell. :) (would love to see the results myself)
The actual fine-tune method to be a closed sourced method, but 2 fun facts:
- dataset was used as its entirety, no filtering, not even removing possible empty/short/long, no addition, the whole dataset was used.
- My local llm-eval benchmarks had a much higher GSM8K than the one on the Leaderboard:
exact_match,strict-match0.9143290371493555
exact_match_stderr,strict-match0.007709218855882762
exact_match,flexible-extract0.9021986353297953
exact_match_stderr,flexible-extract0.00818211982184907
alias"gsm8k"
So it is possible to improve GSM8K of this model even further than what it says on the LB. The last point, the Llama-3-70B is a very sophisticated model! I've never seen such an capable model. I would give it all the credit!