kuleshov-group
/

caduceus-ph_seqlen-131k_d_model-256_n_layer-16

Model card Files Files and versions Community

caduceus-ph_seqlen-131k_d_model-256_n_layer-16 / README.md

yairschiff's picture

Update README.md

de5ba50 verified 11 months ago

|

history blame contribute delete

2.29 kB

	---
	library_name: transformers
	license: apache-2.0
	---

	## Using Caduceus
	To use the pre-trained model for masked language modeling, use the following snippet:
	```python
	from transformers import AutoModelForMaskedLM, AutoTokenizer

	# See the `Caduceus` collection page on the hub for list of available models.
	model_name = "kuleshov-group/caduceus-ph_seqlen-131k_d_model-256_n_layer-16"
	tokenizer = AutoTokenizer.from_pretrained(model_name)
	model = AutoModelForMaskedLM.from_pretrained(model_name)
	```

	Alternatively, you can instantiate a model from scratch to train on your own data as follows:
	```python
	from transformers import AutoConfig, AutoModelForMaskedLM

	# Add any config overrides here, see the `config.json` file on the hub for details.
	config_overrides = {}
	# See the `Caduceus` collection page on the hub for list of available models.
	config = AutoConfig.from_pretrained(
	"kuleshov-group/caduceus-ph_seqlen-131k_d_model-256_n_layer-16",
	**config_overrides,
	)
	model = AutoModelForMaskedLM.from_config(config)
	```

	## Model Details

	This is the Caduceus-Ph model with hidden dimension 256 and 16 MambaDNA layers.
	This model is not inherently reverse complement (RC) equivariant.
	Rather, it was pre-trained using RC data augmentation.
	Its intended usage is as follows: for downstream tasks, the model should be trained with RC data augmentation.
	At downstream task inference, the model should be run twice: once on a sequence and once on its RC.
	The output of these two applications should be combined (averaged) to form the downstream task prediction.

	This model was pre-trained on the human reference genome with sequence length 131,072 for 50k steps (each step contained ~1M base pairs / tokens).

	For more details, please see our paper: [Caduceus: Bi-Directional Equivariant Long-Range DNA Sequence Modeling](https://arxiv.org/abs/2403.03234).

	## Citation

	Please cite our work using the bibtex below:

	BibTeX:
	```
	@article{schiff2024caduceus,
	title={Caduceus: Bi-Directional Equivariant Long-Range DNA Sequence Modeling},
	author={Schiff, Yair and Kao, Chia-Hsiang and Gokaslan, Aaron and Dao, Tri and Gu, Albert and Kuleshov, Volodymyr},
	journal={arXiv preprint arXiv:2403.03234},
	year={2024}
	}
	```

	## Model Card Contact

	Yair Schiff ([email protected])