Arxiv Papers What matters when building vision-language models? Paper • 2405.02246 • Published May 3 • 98