metadata
license: apache-2.0
datasets:
- wikimedia/wikipedia
- bookcorpus
- nomic-ai/nomic-bert-2048-pretraining-data
language:
- en
inference: false
nomic-bert-2048: A 2048 Sequence Length Pretrained BERT
nomic-bert-2048
is a BERT model pretrained on wikipedia
and bookcorpus
with a max sequence length of 2048.
We make several modifications to our BERT training procedure inspired by MosaicBERT. Namely, we:
- Use Rotary Position Embeddings to allow for context length extrapolation.
- Use SwiGLU activations as it has been shown to improve model performance
- Set dropout to 0
We evaluate the quality of nomic-bert-2048 on the standard GLUE benchmark. We find it performs comparably to other BERT models but with the advantage of a significantly longer context length.
Model | Bsz | Steps | Seq | Avg | Cola | SST2 | MRPC | STSB | QQP | MNLI | QNLI | RTE |
---|---|---|---|---|---|---|---|---|---|---|---|---|
NomicBERT | 4k | 100k | 2048 | 0.84 | 0.50 | 0.93 | 0.88 | 0.90 | 0.92 | 0.86 | 0.92 | 0.82 |
RobertaBase | 8k | 500k | 512 | 0.86 | 0.64 | 0.95 | 0.90 | 0.91 | 0.92 | 0.88 | 0.93 | 0.79 |
JinaBERTBase | 4k | 100k | 512 | 0.83 | 0.51 | 0.95 | 0.88 | 0.90 | 0.81 | 0.86 | 0.92 | 0.79 |
MosaicBERT | 4k | 178k | 128 | 0.85 | 0.59 | 0.94 | 0.89 | 0.90 | 0.92 | 0.86 | 0.91 | 0.83 |
Pretraining Data
We use BookCorpus and a 2023 dump of wikipedia. We pack and tokenize the sequences to 2048 tokens. If a document is shorter than 2048 tokens, we append another document until it fits 2048 tokens. If a document is greater than 2048 tokens, we split it across multiple documents. We release the dataset here
Usage
from transformers import AutoModelForMaskedLM, AutoConfig, AutoTokenizer, pipeline
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased') # `nomic-bert-2048` uses the standard BERT tokenizer
config = AutoConfig.from_pretrained('nomic-ai/nomic-bert-2048', trust_remote_code=True) # the config needs to be passed in
model = AutoModelForMaskedLM.from_pretrained('nomic-ai/nomic-bert-2048',config=config, trust_remote_code=True)
# To use this model directly for masked language modeling
classifier = pipeline('fill-mask', model=model, tokenizer=tokenizer,device="cpu")
print(classifier("I [MASK] to the store yesterday."))