Experimenting primarily with 7B-12B parameter text completion models. Not all models are intended for direct use, but aim for educational and/or merge purposes.
I've arrived at an interesting result on the current Open LLM leaderboard. open-llm-leaderboard/open_llm_leaderboard After I narrowed down the filter of models to be between 8-9B parameters, my recent merge of o1 reasoning models achieved the highest MATH eval result of any Llama 3.x 8B model currently on the board, hitting 33.99%, placing 973/2795. grimjim/HuatuoSkywork-o1-Llama-3.1-8B
Unfortunately, I need more information to evaluate the parent models used in the merge. The Skywork/Skywork-o1-Open-Llama-3.1-8B model scored 0% on the MATH eval, which I suspect was due to output formatting that was baked too hard into the model, and placed 2168/2795; the merge achieved a significant uplift in every benchmark across the board. Unfortunately, FreedomIntelligence/HuatuoGPT-o1-8B was not currently benched as of this post, so I am unable to assess relative benchmarks. Nevertheless, it is intriguing that an ostensibly medical o1 model appears to have resulted in a sizable MATH boost.
I'd noticed something was off when merges of Gemma2 9B models ended up having ~10B parameters. The current mergekit package is fine, but there are still bloated models on HF that could stand to be fixed.
The script assumes that it will be run from the same directory as the model weights, and will trim the unnecessary lm_head.weight tensor and corresponding index entry.