arxiv:2412.13663

Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference

Published on Dec 18

· Submitted by

jph00 on Dec 19

#1 Paper of the day

Upvote

103

Authors:

Benjamin Warner ,

Antoine Chaffin ,

Benjamin Clavié ,

Orion Weller ,

Oskar Hallström ,

Said Taghadouini ,

Raja Biswas ,

Faisal Ladhak ,

Tom Aarsen ,

Nathan Cooper ,

Griffin Adams ,

Jeremy Howard ,

Iacopo Poli

Abstract

Encoder-only transformer models such as BERT offer a great performance-size tradeoff for retrieval and classification tasks with respect to larger decoder-only models. Despite being the workhorse of numerous production pipelines, there have been limited Pareto improvements to BERT since its release. In this paper, we introduce ModernBERT, bringing modern model optimizations to encoder-only models and representing a major Pareto improvement over older encoders. Trained on 2 trillion tokens with a native 8192 sequence length, ModernBERT models exhibit state-of-the-art results on a large pool of evaluations encompassing diverse classification tasks and both single and multi-vector retrieval on different domains (including code). In addition to strong downstream performance, ModernBERT is also the most speed and memory efficient encoder and is designed for inference on common GPUs.

View arXiv page View PDF Add to collection

Community

jph00

Paper author Paper submitter 5 days ago

We're very excited about the release of ModernBERT -- it feels like it could be the basis of all kinds of interesting new startups and research projects.

In fact, the stuff mentioned in the paper and blog post is only the tip of the iceberg. There's a lot of opportunities to fine tune the model in all kinds ways, which I expect will go far beyond what we've managed to achieve in our limited exploration so far.

stefan-it

5 days ago

•

edited 5 days ago

We remove the Next-Sentence Prediction objective which introduces noticeable overhead for no performance improvement

But this is only half of the truth and mainly copied from the RoBERTa paper.

The other half: ALBERT paper (see Table 5) shows improvement (NSP over None) - not on SQuAD datasets, but on average. Additionally, their approach of introducing a sentence order prediction loss boosts performance on various downstream tasks.

stefan-it

5 days ago

I would be interested in the number of hardware that is involved in pretraining the base and large models including pretraining time :)

NohTow

Paper author 5 days ago

Hello,

Everything is included in the Table 3 of the paper (Appendix A)

Hope it helps!

librarian-bot

5 days ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

shahjaidev

5 days ago

Great work, especially for most industry tasks

TomSchelsen

4 days ago

•

edited 4 days ago

Thanks for this very welcomed modernisation of 'good old' BERT architecture ;)
However, a big part of the appeal of recent LLM/decoder-only models for a lot of us is their multilingual capability. Would love to see a variant pretrained on more natural languages (instead of code to keep the same training budget, and as the two would be complementary i.e. used for different downstream applications). :)