stereoplegic
's Collections
Efficient Memory Management for Large Language Model Serving with
PagedAttention
Paper
•
2309.06180
•
Published
•
25
LM-Infinite: Simple On-the-Fly Length Generalization for Large Language
Models
Paper
•
2308.16137
•
Published
•
39
Scaling Transformer to 1M tokens and beyond with RMT
Paper
•
2304.11062
•
Published
•
2
DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme
Long Sequence Transformer Models
Paper
•
2309.14509
•
Published
•
17
SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked
Prefills
Paper
•
2308.16369
•
Published
•
1
LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models
Paper
•
2309.12307
•
Published
•
88
PoSE: Efficient Context Window Extension of LLMs via Positional
Skip-wise Training
Paper
•
2309.10400
•
Published
•
26
Efficient Streaming Language Models with Attention Sinks
Paper
•
2309.17453
•
Published
•
13
Replacing softmax with ReLU in Vision Transformers
Paper
•
2309.08586
•
Published
•
17
Adapting Language Models to Compress Contexts
Paper
•
2305.14788
•
Published
•
1
In-context Autoencoder for Context Compression in a Large Language Model
Paper
•
2307.06945
•
Published
•
27
Monarch Mixer: A Simple Sub-Quadratic GEMM-Based Architecture
Paper
•
2310.12109
•
Published
•
1
Linformer: Self-Attention with Linear Complexity
Paper
•
2006.04768
•
Published
•
2
GQA: Training Generalized Multi-Query Transformer Models from Multi-Head
Checkpoints
Paper
•
2305.13245
•
Published
•
5
BTLM-3B-8K: 7B Parameter Performance in a 3B Parameter Model
Paper
•
2309.11568
•
Published
•
10
The Closeness of In-Context Learning and Weight Shifting for Softmax
Regression
Paper
•
2304.13276
•
Published
•
1
S^{3}: Increasing GPU Utilization during Generative Inference for
Higher Throughput
Paper
•
2306.06000
•
Published
•
1
Self-slimmed Vision Transformer
Paper
•
2111.12624
•
Published
•
1
Robustifying Token Attention for Vision Transformers
Paper
•
2303.11126
•
Published
•
1
Combiner: Full Attention Transformer with Sparse Computation Cost
Paper
•
2107.05768
•
Published
•
1
A Unified View of Long-Sequence Models towards Modeling Million-Scale
Dependencies
Paper
•
2302.06218
•
Published
•
1
Attention Bottlenecks for Multimodal Fusion
Paper
•
2107.00135
•
Published
•
1
Blockwise Self-Attention for Long Document Understanding
Paper
•
1911.02972
•
Published
•
1
LSG Attention: Extrapolation of pretrained Transformers to long
sequences
Paper
•
2210.15497
•
Published
•
1
Cure the headache of Transformers via Collinear Constrained Attention
Paper
•
2309.08646
•
Published
•
12
VSA: Learning Varied-Size Window Attention in Vision Transformers
Paper
•
2204.08446
•
Published
•
1
Bird-Eye Transformers for Text Generation Models
Paper
•
2210.03985
•
Published
•
1
Sparsifiner: Learning Sparse Instance-Dependent Attention for Efficient
Vision Transformers
Paper
•
2303.13755
•
Published
•
1
TRAMS: Training-free Memory Selection for Long-range Language Modeling
Paper
•
2310.15494
•
Published
•
1
Pit One Against Many: Leveraging Attention-head Embeddings for
Parameter-efficient Multi-head Attention
Paper
•
2310.07911
•
Published
•
1
Memoria: Hebbian Memory Architecture for Human-Like Sequential
Processing
Paper
•
2310.03052
•
Published
•
1
FlashAttention: Fast and Memory-Efficient Exact Attention with
IO-Awareness
Paper
•
2205.14135
•
Published
•
11
Only 5\% Attention Is All You Need: Efficient Long-range Document-level
Neural Machine Translation
Paper
•
2309.14174
•
Published
•
1
Attention Is Not All You Need Anymore
Paper
•
2308.07661
•
Published
•
1
Attention is Not All You Need: Pure Attention Loses Rank Doubly
Exponentially with Depth
Paper
•
2103.03404
•
Published
•
1
Semantics-aware Attention Improves Neural Machine Translation
Paper
•
2110.06920
•
Published
•
1
Beyond Attentive Tokens: Incorporating Token Importance and Diversity
for Efficient Vision Transformers
Paper
•
2211.11315
•
Published
•
1
Attention Is All You Need
Paper
•
1706.03762
•
Published
•
49
Ultra-Long Sequence Distributed Transformer
Paper
•
2311.02382
•
Published
•
2
Tell Your Model Where to Attend: Post-hoc Attention Steering for LLMs
Paper
•
2311.02262
•
Published
•
10
Attention or Convolution: Transformer Encoders in Audio Language Models
for Inference Efficiency
Paper
•
2311.02772
•
Published
•
3
MiniLMv2: Multi-Head Self-Attention Relation Distillation for
Compressing Pretrained Transformers
Paper
•
2012.15828
•
Published
•
1
MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression
of Pre-Trained Transformers
Paper
•
2002.10957
•
Published
•
1
ConvFormer: Parameter Reduction in Transformer Models for 3D Human Pose
Estimation by Leveraging Dynamic Multi-Headed Convolutional Attention
Paper
•
2304.02147
•
Published
•
1
GateLoop: Fully Data-Controlled Linear Recurrence for Sequence Modeling
Paper
•
2311.01927
•
Published
•
1
Improving Transformers with Probabilistic Attention Keys
Paper
•
2110.08678
•
Published
•
1
Wide Attention Is The Way Forward For Transformers?
Paper
•
2210.00640
•
Published
•
1
A Practical Survey on Faster and Lighter Transformers
Paper
•
2103.14636
•
Published
•
1
Quantizable Transformers: Removing Outliers by Helping Attention Heads
Do Nothing
Paper
•
2306.12929
•
Published
•
12
Scaling TransNormer to 175 Billion Parameters
Paper
•
2307.14995
•
Published
•
21
ShiftAddViT: Mixture of Multiplication Primitives Towards Efficient
Vision Transformer
Paper
•
2306.06446
•
Published
•
1
Are Sixteen Heads Really Better than One?
Paper
•
1905.10650
•
Published
•
2
FlashFFTConv: Efficient Convolutions for Long Sequences with Tensor
Cores
Paper
•
2311.05908
•
Published
•
12
Hiformer: Heterogeneous Feature Interactions Learning with Transformers
for Recommender Systems
Paper
•
2311.05884
•
Published
•
5
Exemplar-free Continual Learning of Vision Transformers via Gated
Class-Attention and Cascaded Feature Drift Compensation
Paper
•
2211.12292
•
Published
•
1
Latency Adjustable Transformer Encoder for Language Understanding
Paper
•
2201.03327
•
Published
•
1
AxFormer: Accuracy-driven Approximation of Transformers for Faster,
Smaller and more Accurate NLP Models
Paper
•
2010.03688
•
Published
•
1
Zero-TPrune: Zero-Shot Token Pruning through Leveraging of the Attention
Graph in Pre-Trained Transformers
Paper
•
2305.17328
•
Published
•
2
Human Guided Exploitation of Interpretable Attention Patterns in
Summarization and Topic Segmentation
Paper
•
2112.05364
•
Published
•
1
Alleviating the Inequality of Attention Heads for Neural Machine
Translation
Paper
•
2009.09672
•
Published
•
1
CAT-probing: A Metric-based Approach to Interpret How Pre-trained Models
for Programming Language Attend Code Structure
Paper
•
2210.04633
•
Published
•
1
Are We Falling in a Middle-Intelligence Trap? An Analysis and Mitigation
of the Reversal Curse
Paper
•
2311.07468
•
Published
•
1
Shifting Attention to Relevance: Towards the Uncertainty Estimation of
Large Language Models
Paper
•
2307.01379
•
Published
•
1
The Information Pathways Hypothesis: Transformers are Dynamic
Self-Ensembles
Paper
•
2306.01705
•
Published
•
1
Relaxed Attention for Transformer Models
Paper
•
2209.09735
•
Published
•
1
System 2 Attention (is something you might need too)
Paper
•
2311.11829
•
Published
•
39
Rethinking Attention: Exploring Shallow Feed-Forward Neural Networks as
an Alternative to Attention Layers in Transformers
Paper
•
2311.10642
•
Published
•
23
Superiority of Softmax: Unveiling the Performance Edge Over Linear
Attention
Paper
•
2310.11685
•
Published
•
1
Attention Sorting Combats Recency Bias In Long Context Language Models
Paper
•
2310.01427
•
Published
•
1
Gated recurrent neural networks discover attention
Paper
•
2309.01775
•
Published
•
7
Your Transformer May Not be as Powerful as You Expect
Paper
•
2205.13401
•
Published
•
1
Adaptive Sparse and Monotonic Attention for Transformer-based Automatic
Speech Recognition
Paper
•
2209.15176
•
Published
•
1
Low Rank Factorization for Compact Multi-Head Self-Attention
Paper
•
1912.00835
•
Published
•
1
Linear Self-Attention Approximation via Trainable Feedforward Kernel
Paper
•
2211.04076
•
Published
•
1
Low-Rank Bottleneck in Multi-head Attention Models
Paper
•
2002.07028
•
Published
•
1
EfficientFormer: Vision Transformers at MobileNet Speed
Paper
•
2206.01191
•
Published
•
1
Transformer in Transformer
Paper
•
2103.00112
•
Published
•
1
COMCAT: Towards Efficient Compression and Customization of
Attention-Based Vision Models
Paper
•
2305.17235
•
Published
•
2
CoLT5: Faster Long-Range Transformers with Conditional Computation
Paper
•
2303.09752
•
Published
•
2
Fourier Transformer: Fast Long Range Modeling by Removing Sequence
Redundancy with FFT Operator
Paper
•
2305.15099
•
Published
•
1
SparQ Attention: Bandwidth-Efficient LLM Inference
Paper
•
2312.04985
•
Published
•
38
SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention
Paper
•
2312.07987
•
Published
•
41
Efficient Monotonic Multihead Attention
Paper
•
2312.04515
•
Published
•
6
SCCA: Shifted Cross Chunk Attention for long contextual semantic
expansion
Paper
•
2312.07305
•
Published
•
1
Zebra: Extending Context Window with Layerwise Grouped Local-Global
Attention
Paper
•
2312.08618
•
Published
•
11
Is Model Attention Aligned with Human Attention? An Empirical Study on
Large Language Models for Code Generation
Paper
•
2306.01220
•
Published
•
1
Mixture of Attention Heads: Selecting Attention Heads Per Token
Paper
•
2210.05144
•
Published
•
2
LKCA: Large Kernel Convolutional Attention
Paper
•
2401.05738
•
Published
•
1
HyperAttention: Long-context Attention in Near-Linear Time
Paper
•
2310.05869
•
Published
•
2
Rethinking Attention with Performers
Paper
•
2009.14794
•
Published
•
1
Attention Lens: A Tool for Mechanistically Interpreting the Attention
Head Information Retrieval Mechanism
Paper
•
2310.16270
•
Published
•
1
Softmax-free Linear Transformers
Paper
•
2207.03341
•
Published
•
1
Gated Linear Attention Transformers with Hardware-Efficient Training
Paper
•
2312.06635
•
Published
•
6
Pixelated Butterfly: Simple and Efficient Sparse training for Neural
Network Models
Paper
•
2112.00029
•
Published
•
1
Can Mamba Learn How to Learn? A Comparative Study on In-Context Learning
Tasks
Paper
•
2402.04248
•
Published
•
30
A Quantitative Review on Language Model Efficiency Research
Paper
•
2306.01768
•
Published
•
1
Agent Attention: On the Integration of Softmax and Linear Attention
Paper
•
2312.08874
•
Published
•
2
FLatten Transformer: Vision Transformer using Focused Linear Attention
Paper
•
2308.00442
•
Published
•
1
Linear Transformers with Learnable Kernel Functions are Better
In-Context Models
Paper
•
2402.10644
•
Published
•
79
Griffin: Mixing Gated Linear Recurrences with Local Attention for
Efficient Language Models
Paper
•
2402.19427
•
Published
•
52
Simple linear attention language models balance the recall-throughput
tradeoff
Paper
•
2402.18668
•
Published
•
18
Linear Transformers are Versatile In-Context Learners
Paper
•
2402.14180
•
Published
•
6
Attention Approximates Sparse Distributed Memory
Paper
•
2111.05498
•
Published
Multi-Scale Self-Attention for Text Classification
Paper
•
1912.00544
•
Published
Scattered Mixture-of-Experts Implementation
Paper
•
2403.08245
•
Published
•
1
Factorization Vision Transformer: Modeling Long Range Dependency with
Local Window Cost
Paper
•
2312.08614
•
Published
•
1
JetMoE: Reaching Llama2 Performance with 0.1M Dollars
Paper
•
2404.07413
•
Published
•
36
SLAB: Efficient Transformers with Simplified Linear Attention and
Progressive Re-parameterized Batch Normalization
Paper
•
2405.11582
•
Published
•
13
Yuan 2.0-M32: Mixture of Experts with Attention Router
Paper
•
2405.17976
•
Published
•
18
LongHeads: Multi-Head Attention is Secretly a Long Context Processor
Paper
•
2402.10685
•
Published
•
1
Samba: Simple Hybrid State Space Models for Efficient Unlimited Context
Language Modeling
Paper
•
2406.07522
•
Published
•
37
SinkLoRA: Enhanced Efficiency and Chat Capabilities for Long-Context
Large Language Models
Paper
•
2406.05678
•
Published
Various Lengths, Constant Speed: Efficient Language Modeling with
Lightning Attention
Paper
•
2405.17381
•
Published
The Hedgehog & the Porcupine: Expressive Linear Attentions with Softmax
Mimicry
Paper
•
2402.04347
•
Published
•
13
GoldFinch: High Performance RWKV/Transformer Hybrid with Linear Pre-Fill
and Extreme KV-Cache Compression
Paper
•
2407.12077
•
Published
•
54
SampleAttention: Near-Lossless Acceleration of Long Context LLM
Inference with Adaptive Structured Sparse Attention
Paper
•
2406.15486
•
Published
RazorAttention: Efficient KV Cache Compression Through Retrieval Heads
Paper
•
2407.15891
•
Published
Tree Attention: Topology-aware Decoding for Long-Context Attention on
GPU clusters
Paper
•
2408.04093
•
Published
•
4
Theory, Analysis, and Best Practices for Sigmoid Self-Attention
Paper
•
2409.04431
•
Published
•
1
Weighted Grouped Query Attention in Transformers
Paper
•
2407.10855
•
Published
On the Benefits of Rank in Attention Layers
Paper
•
2407.16153
•
Published
Beyond KV Caching: Shared Attention for Efficient LLMs
Paper
•
2407.12866
•
Published
Beyond Uniform Query Distribution: Key-Driven Grouped Query Attention
Paper
•
2408.08454
•
Published
Efficient LLM Training and Serving with Heterogeneous Context Sharding
among Attention Heads
Paper
•
2407.17678
•
Published
Post-Training Sparse Attention with Double Sparsity
Paper
•
2408.07092
•
Published
Palu: Compressing KV-Cache with Low-Rank Projection
Paper
•
2407.21118
•
Published
•
1
Inference-Friendly Models With MixAttention
Paper
•
2409.15012
•
Published