Shifting Long-Context LLMs Research from Input to Output Paper • 2503.04723 • Published 13 days ago • 19
Optimizing Test-Time Compute via Meta Reinforcement Fine-Tuning Paper • 2503.07572 • Published 9 days ago • 36
LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through Two-Stage Rule-Based RL Paper • 2503.07536 • Published 9 days ago • 78
Linear-MoE: Linear Sequence Modeling Meets Mixture-of-Experts Paper • 2503.05447 • Published 12 days ago • 7
Forgetting Transformer: Softmax Attention with a Forget Gate Paper • 2503.02130 • Published 15 days ago • 27
HybridNorm: Towards Stable and Efficient Transformer Training via Hybrid Normalization Paper • 2503.04598 • Published 13 days ago • 17
PLDR-LLMs Learn A Generalizable Tensor Operator That Can Replace Its Own Deep Neural Net At Inference Paper • 2502.13502 • Published 28 days ago • 3
Liger: Linearizing Large Language Models to Gated Recurrent Structures Paper • 2503.01496 • Published 16 days ago • 15
Cognitive Behaviors that Enable Self-Improving Reasoners, or, Four Habits of Highly Effective STaRs Paper • 2503.01307 • Published 16 days ago • 31
Scale-Distribution Decoupling: Enabling Stable and Effective Training of Large Language Models Paper • 2502.15499 • Published 26 days ago • 13
Stable-SPAM: How to Train in 4-Bit More Stably than 16-Bit Adam Paper • 2502.17055 • Published 23 days ago • 16
AlphaMaze: Enhancing Large Language Models' Spatial Intelligence via GRPO Paper • 2502.14669 • Published 27 days ago • 12
Revisiting the Test-Time Scaling of o1-like Models: Do they Truly Possess Test-Time Scaling Capabilities? Paper • 2502.12215 • Published about 1 month ago • 16
You Do Not Fully Utilize Transformer's Representation Capacity Paper • 2502.09245 • Published Feb 13 • 34
Cramming 1568 Tokens into a Single Vector and Back Again: Exploring the Limits of Embedding Space Capacity Paper • 2502.13063 • Published 29 days ago • 67