Edit model card
YAML Metadata Error: "license" does not match any of the allowed types
YAML Metadata Error: "language[0]" with value "english" is not valid. It must be an ISO 639-1, 639-2 or 639-3 code (two/three letters), or a special value like "code", "multilingual". If you want to use BCP-47 identifiers, you can specify them in language_bcp47.

bio-lm

Model description

This model is a RoBERTa base pre-trained model that was further trained using a masked language modeling task on a compendium of english scientific textual examples from the life sciences using the BioLang dataset.

Intended uses & limitations

How to use

The intended use of this model is to be fine-tuned for downstream tasks, token classification in particular.

To have a quick check of the model as-is in a fill-mask task:

from transformers import pipeline, RobertaTokenizerFast
tokenizer = RobertaTokenizerFast.from_pretrained('roberta-base', max_len=512)
text = "Let us try this model to see if it <mask>."
fill_mask = pipeline(
    "fill-mask",
    model='EMBO/bio-lm',
    tokenizer=tokenizer
)
fill_mask(text)

Limitations and bias

This model should be fine-tuned on a specifi task like token classification. The model must be used with the roberta-base tokenizer.

Training data

The model was trained with a masked language modeling taskon the BioLang dataset wich includes 12Mio examples from abstracts and figure legends extracted from papers published in life sciences.

Training procedure

The training was run on a NVIDIA DGX Station with 4XTesla V100 GPUs.

Training code is available at https://github.com/source-data/soda-roberta

  • Command: python -m lm.train /data/json/oapmc_abstracts_figs/ MLM
  • Tokenizer vocab size: 50265
  • Training data: EMBO/biolang MLM
  • Training with: 12005390 examples
  • Evaluating on: 36713 examples
  • Epochs: 3.0
  • per_device_train_batch_size: 16
  • per_device_eval_batch_size: 16
  • learning_rate: 5e-05
  • weight_decay: 0.0
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • max_grad_norm: 1.0
  • tensorboard run: lm-MLM-2021-01-27T15-17-43.113766

End of training:

trainset: 'loss': 0.8653350830078125
validation set: 'eval_loss': 0.8192330598831177, 'eval_recall': 0.8154601116513597

Eval results

Eval on test set:

recall: 0.814471959728645
Downloads last month
12
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Dataset used to train EMBO/bio-lm