-
The Impact of Positional Encoding on Length Generalization in Transformers
Paper • 2305.19466 • Published • 2 -
Transformers Can Do Arithmetic with the Right Embeddings
Paper • 2405.17399 • Published • 52 -
Teaching Transformers Causal Reasoning through Axiomatic Training
Paper • 2407.07612 • Published • 2 -
Round and Round We Go! What makes Rotary Positional Encodings useful?
Paper • 2410.06205 • Published • 1
Collections
Discover the best community collections!
Collections including paper arxiv:2412.09871
-
The Impact of Positional Encoding on Length Generalization in Transformers
Paper • 2305.19466 • Published • 2 -
Qwen2 Technical Report
Paper • 2407.10671 • Published • 160 -
Round and Round We Go! What makes Rotary Positional Encodings useful?
Paper • 2410.06205 • Published • 1 -
ThunderKittens: Simple, Fast, and Adorable AI Kernels
Paper • 2410.20399 • Published • 1
-
Instruction Tuning with Human Curriculum
Paper • 2310.09518 • Published • 3 -
A Thorough Examination of Decoding Methods in the Era of LLMs
Paper • 2402.06925 • Published • 1 -
Distilling System 2 into System 1
Paper • 2407.06023 • Published • 3 -
Byte Latent Transformer: Patches Scale Better Than Tokens
Paper • 2412.09871 • Published • 75
-
Gecko: Versatile Text Embeddings Distilled from Large Language Models
Paper • 2403.20327 • Published • 47 -
Round and Round We Go! What makes Rotary Positional Encodings useful?
Paper • 2410.06205 • Published • 1 -
Byte Latent Transformer: Patches Scale Better Than Tokens
Paper • 2412.09871 • Published • 75 -
MrT5: Dynamic Token Merging for Efficient Byte-level Language Models
Paper • 2410.20771 • Published • 3
-
LIMA: Less Is More for Alignment
Paper • 2305.11206 • Published • 21 -
Garment3DGen: 3D Garment Stylization and Texture Generation
Paper • 2403.18816 • Published • 21 -
EgoLifter: Open-world 3D Segmentation for Egocentric Perception
Paper • 2403.18118 • Published • 10 -
The Unreasonable Ineffectiveness of the Deeper Layers
Paper • 2403.17887 • Published • 78
-
Vid2Robot: End-to-end Video-conditioned Policy Learning with Cross-Attention Transformers
Paper • 2403.12943 • Published • 14 -
Masked Audio Generation using a Single Non-Autoregressive Transformer
Paper • 2401.04577 • Published • 42 -
Cross-Attention Makes Inference Cumbersome in Text-to-Image Diffusion Models
Paper • 2404.02747 • Published • 11 -
InstantStyle: Free Lunch towards Style-Preserving in Text-to-Image Generation
Paper • 2404.02733 • Published • 20
-
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training
Paper • 2403.09611 • Published • 125 -
Evolutionary Optimization of Model Merging Recipes
Paper • 2403.13187 • Published • 50 -
MobileVLM V2: Faster and Stronger Baseline for Vision Language Model
Paper • 2402.03766 • Published • 12 -
LLM Agent Operating System
Paper • 2403.16971 • Published • 65