
jxm/gpt-oss-20b-base
Text Generation
•
21B
•
Updated
•
1.33k
•
140
I also have a hypothesis that this model can be efficiently downsized not by pruning experts, but by using merges and LoRAs to downsize their unique parameter count. The merge would be most of the shared parameters, and the routing table need not change.
I'm building up a new version of my pipeline to test this hypothesis. I suspect it'd let us get most of the performance in <12B parameters.
This is a very cool release! I really enjoy the ShiningValiant series!
Do you see potential to prune experts or layers from the gpt-oss-20b model to downsize it, and then finetune?