|
--- |
|
library_name: transformers |
|
license: apache-2.0 |
|
--- |
|
|
|
## Using Caduceus |
|
To use the pre-trained model for masked language modeling, use the following snippet: |
|
```python |
|
from transformers import AutoModelForMaskedLM, AutoTokenizer |
|
|
|
# See the `Caduceus` collection page on the hub for list of available models. |
|
model_name = "kuleshov-group/caduceus-ph_seqlen-131k_d_model-256_n_layer-16" |
|
tokenizer = AutoTokenizer.from_pretrained(model_name) |
|
model = AutoModelForMaskedLM.from_pretrained(model_name) |
|
``` |
|
|
|
Alternatively, you can instantiate a model from scratch to train on your own data as follows: |
|
```python |
|
from transformers import AutoConfig, AutoModelForMaskedLM |
|
|
|
# Add any config overrides here, see the `config.json` file on the hub for details. |
|
config_overrides = {} |
|
# See the `Caduceus` collection page on the hub for list of available models. |
|
config = AutoConfig.from_pretrained( |
|
"kuleshov-group/caduceus-ph_seqlen-131k_d_model-256_n_layer-16", |
|
**config_overrides, |
|
) |
|
model = AutoModelForMaskedLM.from_config(config) |
|
``` |
|
|
|
## Model Details |
|
|
|
This is the Caduceus-Ph model with hidden dimension 256 and 16 MambaDNA layers. |
|
This model is not inherently reverse complement (RC) equivariant. |
|
Rather, it was pre-trained using RC data augmentation. |
|
Its intended usage is as follows: for downstream tasks, the model should be trained with RC data augmentation. |
|
At downstream task inference, the model should be run twice: once on a sequence and once on its RC. |
|
The output of these two applications should be combined (averaged) to form the downstream task prediction. |
|
|
|
This model was pre-trained on the human reference genome with sequence length 131,072 for 50k steps (each step contained ~1M base pairs / tokens). |
|
|
|
For more details, please see our paper: [Caduceus: Bi-Directional Equivariant Long-Range DNA Sequence Modeling](https://arxiv.org/abs/2403.03234). |
|
|
|
## Citation |
|
|
|
Please cite our work using the bibtex below: |
|
|
|
**BibTeX:** |
|
``` |
|
@article{schiff2024caduceus, |
|
title={Caduceus: Bi-Directional Equivariant Long-Range DNA Sequence Modeling}, |
|
author={Schiff, Yair and Kao, Chia-Hsiang and Gokaslan, Aaron and Dao, Tri and Gu, Albert and Kuleshov, Volodymyr}, |
|
journal={arXiv preprint arXiv:2403.03234}, |
|
year={2024} |
|
} |
|
``` |
|
|
|
## Model Card Contact |
|
|
|
Yair Schiff ([email protected]) |