BarcodeBERT for Taxonomic Classification

A pre-trained transformer model for inference on insect DNA barcoding data.

Colab

To use BarcodeBERT as a feature extractor:

from transformers import AutoTokenizer, AutoModel

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained("bioscan-ml/BarcodeBERT", trust_remote_code=True)

#Load the model 
model = AutoModel.from_pretrained("bioscan-ml/BarcodeBERT", trust_remote_code=True)

# Sample sequence
dna_seq = 'ACGCGCTGACGCATCAGCATACGA'

# Tokenize
input_seq = tokenizer(dna_seq, return_tensors = 'pt')['input_ids']

# Pass through the model
output = model(input_seq)['hidden_states'][-1]

# Compute Global Average Pooling 
features = output.mean(1)

Citation

If you find BarcodeBERT useful in your research please consider citing:

@misc{arias2023barcodebert,
  title={{BarcodeBERT}: Transformers for Biodiversity Analysis},
  author={Pablo Millan Arias
    and Niousha Sadjadi
    and Monireh Safari
    and ZeMing Gong
    and Austin T. Wang
    and Scott C. Lowe
    and Joakim Bruslund Haurum
    and Iuliia Zarubiieva
    and Dirk Steinke
    and Lila Kari
    and Angel X. Chang
    and Graham W. Taylor
  },
  year={2023},
  eprint={2311.02401},
  archivePrefix={arXiv},
  primaryClass={cs.LG},
  doi={10.48550/arxiv.2311.02401},
}
Downloads last month
171
Safetensors
Model size
29.1M params
Tensor type
I64
·
F32
·
Inference API
Unable to determine this model's library. Check the docs .