|
--- |
|
library_name: transformers |
|
tags: |
|
- DNA |
|
- genomics |
|
datasets: |
|
- omicseye/prok_heavy |
|
--- |
|
|
|
## Introduction |
|
|
|
The seqLens models are a collection of genomic language models. |
|
seqLens models leverage an extensive dataset of 19,551 reference genomes, |
|
including over 18,000 prokaryotic genomes (115B nucleotides), |
|
alongside a more balanced dataset of 1,354 genomes spanning 1,166 prokaryotic and 188 eukaryotic reference genomes (180B nucleotides). |
|
Through systematic evaluation of 52 DNA language models with varying architectures, hyperparameters, and classification heads, |
|
we developed seqLens, a family of models based on disentangled attention with relative positional encoding. |
|
These models demonstrate superior performance, outperforming state-of-the-art methods in phenotypic predictions. |
|
The seqLens models provide a robust foundation for optimizing DNA language models and advancing genome annotations across diverse biological contexts. |
|
|
|
- **Developed by:** omicseye |
|
|
|
- **Model type:** Encoder |
|
- **Language(s) (NLP):** DNA |
|
|
|
- **pretraining dataset:** omicseye/prok_heavy |
|
- **License:** The model is made available under the [CC-BY-NC 4.0 License]. For inquiries about commercial licensing, please contact [email protected]. |
|
|
|
<p align="center"> |
|
<img width="100%" src="https://github.com/omicsEye/seqLens/blob/main/visualizations/plots/png/model_performance_all.png?raw=true"> |
|
</p> |
|
|
|
### Model Sources |
|
|
|
<!-- Provide the basic links for the model. --> |
|
|
|
- **Repository:** https://github.com/omicsEye/seqLens |
|
- **Paper:** https://doi.org/10.1101/2025.03.12.642848 |
|
|
|
## How to Get Started with the Model |
|
|
|
```python |
|
from transformers import AutoTokenizer, AutoModelForMaskedLM |
|
|
|
tokenizer = AutoTokenizer.from_pretrained("omicseye/seqLens_esm_4096_512_55M") |
|
model = AutoModelForMaskedLM.from_pretrained("omicseye/seqLens_esm_4096_512_55M") |
|
``` |
|
|
|
## Citation |
|
```bibtex |
|
@article {seqLens, |
|
author = {Baghbanzadeh, Mahdi and Mann, Brendan and Crandall, Keith A and Rahnavard, Ali}, |
|
title = {seqLens: optimizing language models for genomic predictions}, |
|
elocation-id = {2025.03.12.642848}, |
|
year = {2025}, |
|
doi = {10.1101/2025.03.12.642848}, |
|
publisher = {Cold Spring Harbor Laboratory}, |
|
URL = {https://www.biorxiv.org/content/early/2025/03/14/2025.03.12.642848}, |
|
eprint = {https://www.biorxiv.org/content/early/2025/03/14/2025.03.12.642848.full.pdf}, |
|
journal = {bioRxiv} |
|
} |
|
``` |