README.md · arazd/MIReAD at e81a7bd9c9f142192f8093821e4a244894118a71

metadata

license: apache-2.0
language:
  - en
library_name: transformers
tags:
  - pubmed
  - arxiv
  - representations
  - scientific documents
  - bert
widget:
  - text: >-
      Tissue-based diagnostics and research is incessantly evolving with the
      development of new molecular tools. It has long been realized that
      immunohistochemistry can add an important new level of information on top
      of morphology and that protein expression patterns in a cancer may yield
      crucial diagnostic and prognostic information. We have generated an
      immunohistochemistry-based map of protein expression profiles in normal
      tissues, cancer and cell lines.
    example_title: Journal prediction

This is the finetuned model presented in MIReAD: simple method for learning high-quality representations from scientific documents (ACL 2023).

We trained MIReAD on >500,000 PubMed and arXiv abstracts across over 2,000 journal classes. MIReAD was initialized with SciBERT weights and finetuned to predict journal class based on the abstract and title of the paper. MIReAD uses SciBERT's tokenizer.

Overall, with MIReAD you can:

extract semantically meaningful representation using paper's abstact
predict journal class based on paper's abstract

To load the MIReAD model:

from transformers import BertForSequenceClassification, AutoTokenizer

mpath = 'arazd/miread'
model = BertForSequenceClassification.from_pretrained(mpath)
tokenizer = AutoTokenizer.from_pretrained(mpath)

To use MIReAD for feature extraction and classification:

# sample abstract & title text
title = 'MIReAD: simple method for learning scientific representations'
abstr = 'Learning semantically meaningful representations from scientific documents can ...'
text = title + tokenizer.sep_token + abstr
source_len = 512
inputs = tokenizer(text,
                   max_length = source_len,
                   pad_to_max_length=True,
                   truncation=True,
                   return_tensors="pt")

# classification (getting logits over 2,734 journal classes)
out = model(**inputs)
logits = out.logits 

# feature extraction (getting 768-dimensional feature profiles)
out = model.bert(**inputs)
# IMPORTANT: use [CLS] token representation as document-level representation (hence, 0th idx)
feature = out.last_hidden_state[:, 0, :]

You can access our PubMed and arXiv abstracts and journal labels data here: https://huggingface.co/datasets/brainchalov/pubmed_arxiv_abstracts_data.

To cite this work:

@inproceedings{razdaibiedina2023miread,
   title={MIReAD: simple method for learning high-quality representations from scientific documents},
   author={Razdaibiedina, Anastasia and Brechalov, Alexander},
   booktitle={Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics},
   year={2023}
}