Collections
Discover the best community collections!
Collections including paper arxiv:2105.13626
-
ByT5: Towards a token-free future with pre-trained byte-to-byte models
Paper • 2105.13626 • Published • 3 -
Byte Latent Transformer: Patches Scale Better Than Tokens
Paper • 2412.09871 • Published • 88 -
MrT5: Dynamic Token Merging for Efficient Byte-level Language Models
Paper • 2410.20771 • Published • 3
-
Beyond A*: Better Planning with Transformers via Search Dynamics Bootstrapping
Paper • 2402.14083 • Published • 47 -
GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints
Paper • 2305.13245 • Published • 5 -
Training a T5 Using Lab-sized Resources
Paper • 2208.12097 • Published • 1 -
Sparse Upcycling: Training Mixture-of-Experts from Dense Checkpoints
Paper • 2212.05055 • Published • 5
-
CodeBERT: A Pre-Trained Model for Programming and Natural Languages
Paper • 2002.08155 • Published • 2 -
Text Generation with Diffusion Language Models: A Pre-training Approach with Continuous Paragraph Denoise
Paper • 2212.11685 • Published • 2 -
ReNoise: Real Image Inversion Through Iterative Noising
Paper • 2403.14602 • Published • 19 -
ByT5: Towards a token-free future with pre-trained byte-to-byte models
Paper • 2105.13626 • Published • 3
-
A Biomedical Entity Extraction Pipeline for Oncology Health Records in Portuguese
Paper • 2304.08999 • Published • 2 -
CulturaX: A Cleaned, Enormous, and Multilingual Dataset for Large Language Models in 167 Languages
Paper • 2309.09400 • Published • 84 -
Robust Open-Vocabulary Translation from Visual Text Representations
Paper • 2104.08211 • Published • 1 -
Poro 34B and the Blessing of Multilinguality
Paper • 2404.01856 • Published • 13
-
SELF: Language-Driven Self-Evolution for Large Language Model
Paper • 2310.00533 • Published • 2 -
GrowLength: Accelerating LLMs Pretraining by Progressively Growing Training Length
Paper • 2310.00576 • Published • 2 -
A Pretrainer's Guide to Training Data: Measuring the Effects of Data Age, Domain Coverage, Quality, & Toxicity
Paper • 2305.13169 • Published • 3 -
Transformers Can Achieve Length Generalization But Not Robustly
Paper • 2402.09371 • Published • 13
-
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Paper • 1810.04805 • Published • 16 -
Transformers Can Achieve Length Generalization But Not Robustly
Paper • 2402.09371 • Published • 13 -
Triple-Encoders: Representations That Fire Together, Wire Together
Paper • 2402.12332 • Published • 2 -
BERTs are Generative In-Context Learners
Paper • 2406.04823 • Published • 1