stereoplegic
's Collections
Super-NaturalInstructions: Generalization via Declarative Instructions
on 1600+ NLP Tasks
Paper
•
2204.07705
•
Published
•
1
Knowledge-Driven CoT: Exploring Faithful Reasoning in LLMs for
Knowledge-intensive Question Answering
Paper
•
2308.13259
•
Published
•
2
MAmmoTH: Building Math Generalist Models through Hybrid Instruction
Tuning
Paper
•
2309.05653
•
Published
•
10
MetaMath: Bootstrap Your Own Mathematical Questions for Large Language
Models
Paper
•
2309.12284
•
Published
•
18
CulturaX: A Cleaned, Enormous, and Multilingual Dataset for Large
Language Models in 167 Languages
Paper
•
2309.09400
•
Published
•
84
CommonCanvas: An Open Diffusion Model Trained with Creative-Commons
Images
Paper
•
2310.16825
•
Published
•
32
The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset
Paper
•
2303.03915
•
Published
•
6
MADLAD-400: A Multilingual And Document-Level Large Audited Dataset
Paper
•
2309.04662
•
Published
•
22
SlimPajama-DC: Understanding Data Combinations for LLM Training
Paper
•
2309.10818
•
Published
•
10
Towards Effective Disambiguation for Machine Translation with Large
Language Models
Paper
•
2309.11668
•
Published
•
1
Improving Translation Faithfulness of Large Language Models via
Augmenting Instructions
Paper
•
2308.12674
•
Published
•
1
TinyStories: How Small Can Language Models Be and Still Speak Coherent
English?
Paper
•
2305.07759
•
Published
•
33
The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora
with Web Data, and Web Data Only
Paper
•
2306.01116
•
Published
•
32
KoSBi: A Dataset for Mitigating Social Bias Risks Towards Safer Large
Language Model Application
Paper
•
2305.17701
•
Published
•
1
M^3IT: A Large-Scale Dataset towards Multi-Modal Multilingual
Instruction Tuning
Paper
•
2306.04387
•
Published
•
8
The Flan Collection: Designing Data and Methods for Effective
Instruction Tuning
Paper
•
2301.13688
•
Published
•
8
PIXIU: A Large Language Model, Instruction Data and Evaluation Benchmark
for Finance
Paper
•
2306.05443
•
Published
•
3
NusaWrites: Constructing High-Quality Corpora for Underrepresented and
Extremely Low-Resource Languages
Paper
•
2309.10661
•
Published
•
1
The Vault: A Comprehensive Multilingual Dataset for Advancing Code
Understanding and Generation
Paper
•
2305.06156
•
Published
•
2
Searching by Code: a New SearchBySnippet Dataset and SnippeR Retrieval
Model for Searching by Code Snippets
Paper
•
2305.11625
•
Published
•
1
MVP: Multi-task Supervised Pre-training for Natural Language Generation
Paper
•
2206.12131
•
Published
•
1
The Data Provenance Initiative: A Large Scale Audit of Dataset Licensing
& Attribution in AI
Paper
•
2310.16787
•
Published
•
5
Language Models can be Logical Solvers
Paper
•
2311.06158
•
Published
•
18
Do-Not-Answer: A Dataset for Evaluating Safeguards in LLMs
Paper
•
2308.13387
•
Published
•
1
Skill-it! A Data-Driven Skills Framework for Understanding and Training
Language Models
Paper
•
2307.14430
•
Published
•
3
FACT: Learning Governing Abstractions Behind Integer Sequences
Paper
•
2209.09543
•
Published
•
2
HelpSteer: Multi-attribute Helpfulness Dataset for SteerLM
Paper
•
2311.09528
•
Published
•
2
Source Prompt: Coordinated Pre-training of Language Models on Diverse
Corpora from Multiple Sources
Paper
•
2311.09732
•
Published
•
1
Sabiá: Portuguese Large Language Models
Paper
•
2304.07880
•
Published
•
4
Generative AI for Math: Part I -- MathPile: A Billion-Token-Scale
Pretraining Corpus for Math
Paper
•
2312.17120
•
Published
•
25
COSMO: COntrastive Streamlined MultimOdal Model with Interleaved
Pre-Training
Paper
•
2401.00849
•
Published
•
15
Trusted Source Alignment in Large Language Models
Paper
•
2311.06697
•
Published
•
10
Viewer
•
Updated
•
2.75M
•
10.5k
•
336
ChatQA: Building GPT-4 Level Conversational QA Models
Paper
•
2401.10225
•
Published
•
34
Scientific and Creative Analogies in Pretrained Language Models
Paper
•
2211.15268
•
Published
•
1
Dolma: an Open Corpus of Three Trillion Tokens for Language Model
Pretraining Research
Paper
•
2402.00159
•
Published
•
61
Aya Dataset: An Open-Access Collection for Multilingual Instruction
Tuning
Paper
•
2402.06619
•
Published
•
54
StarCoder 2 and The Stack v2: The Next Generation
Paper
•
2402.19173
•
Published
•
136
A Touch, Vision, and Language Dataset for Multimodal Alignment
Paper
•
2402.13232
•
Published
•
14
GECTurk: Grammatical Error Correction and Detection Dataset for Turkish
Paper
•
2309.11346
•
Published
WildChat: 1M ChatGPT Interaction Logs in the Wild
Paper
•
2405.01470
•
Published
•
61
OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning Dataset
Paper
•
2402.10176
•
Published
•
36
MS MARCO Web Search: a Large-scale Information-rich Web Dataset with
Millions of Real Click Labels
Paper
•
2405.07526
•
Published
•
18
MINT-1T: Scaling Open-Source Multimodal Data by 10x: A Multimodal
Dataset with One Trillion Tokens
Paper
•
2406.11271
•
Published
•
20
XLand-100B: A Large-Scale Multi-Task Dataset for In-Context
Reinforcement Learning
Paper
•
2406.08973
•
Published
•
86
DataComp-LM: In search of the next generation of training sets for
language models
Paper
•
2406.11794
•
Published
•
50
PIN: A Knowledge-Intensive Dataset for Paired and Interleaved Multimodal
Documents
Paper
•
2406.13923
•
Published
•
21
Glot500: Scaling Multilingual Corpora and Language Models to 500
Languages
Paper
•
2305.12182
•
Published
•
1