Model Card for Jamba-DNA-v1-134M-hg38 (Jamba for DNA)

The Jamba-DNA-v1-134M-hg38 Large Language Model (LLM) is a pretrained generative DNA sequence model with 134M parameters. It is derived from Jamba model, which was simplified for DNA: the number of layers and the hidden size were reduced. The model was pretrained using 100kb DNA sequences from the hg38 human genome assembly. In comparison, Mistral DNA models were trained using 10 kb DNA sequences, so there are able to deal with a small genomic sequence context.

Model Architecture

Jamba is a state-of-the-art, hybrid SSM-Transformer LLM. It is the first production-scale Mamba implementation, which opens up interesting research and application opportunities. Jamba is a pretrained, mixture-of-experts (MoE) generative text model.

Load the model from huggingface:

import torch
from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("RaphaelMourad/Jamba-DNA-v1-134M-hg38", trust_remote_code=True) 
model = AutoModel.from_pretrained("RaphaelMourad/Jamba-DNA-v1-134M-hg38", trust_remote_code=True)

Calculate the embedding of a protein sequence

DNAseq = "TGATGATTGGCGCGGCTAGGATCGGCT"
inputs = tokenizer(DNAseq, return_tensors = 'pt')["input_ids"]
hidden_states = model(inputs)[0] # [1, sequence_length, 256]

# embedding with max pooling
embedding_max = torch.max(hidden_states[0], dim=0)[0]
print(embedding_max.shape) # expect to be 256

Troubleshooting

Ensure you are utilizing a stable version of Transformers, 4.39.0 or newer.

In order to run optimized Mamba implementations, you first need to install mamba-ssm and causal-conv1d:

pip install mamba-ssm causal-conv1d>=1.2.0

Notice

Jamba-DNA-v1-134M-hg38 is a pretrained base model for DNA.

Contact

Raphaël Mourad. [email protected]