InstaDeepAI
/

BulkRNABert

@@ -1,14 +1,14 @@
 ---
 library_name: transformers
 tags:
-- bulk RNA-seq
-- biology
-- transcriptomics
 ---
 # BulkRNABert
-BulkRNABert is a transformer-based, encoder-only language model pre-trained on bulk RNA-seq data using self-supervision via masked language modeling, following BERT’s method.  It can be further fine-tuned for cancer type classification and survival time prediction on the TCGA dataset.
 **Developed by:** [InstaDeep](https://huggingface.co/InstaDeepAI)
@@ -29,23 +29,47 @@ pip install --upgrade git+https://github.com/huggingface/transformers.git
 pip install torch
 ```
-A small snippet of code is provided below to run inference with the model using random input.
 ```
-import torch
-from transformers import AutoConfig, AutoModel
 model = AutoModel.from_pretrained(
     "InstaDeepAI/BulkRNABert",
     trust_remote_code=True,
 )
-n_genes = model.config.n_genes
-dummy_gene_expressions = torch.randint(0, model.config.n_expressions_bins, (1, n_genes))
-torch_output = model(dummy_gene_expressions)
-```
-A more complete example is provided in the repository.
 ### Citing our work

 ---
 library_name: transformers
 tags:
+  - bulk RNA-seq
+  - biology
+  - transcriptomics
 ---
 # BulkRNABert
+BulkRNABert is a transformer-based, encoder-only language model pre-trained on bulk RNA-seq profiles from the TCGA dataset using self-supervised masked language modeling, following the original BERT framework. The model is trained to reconstruct randomly masked gene expression values from their genomic context, enabling it to learn biologically meaningful representations of transcriptomic profiles. Once pre-trained, BulkRNABert can be fine-tuned for various cancer-related downstream tasks—such as cancer type classification or survival analysis—by extracting embeddings from the model.
 **Developed by:** [InstaDeep](https://huggingface.co/InstaDeepAI)
 pip install torch
 ```
+## Other notes
+We also provide the params for the BulkRNABert jax model in `jax_params`.
+A small snippet of code is provided below to run inference with the model using bulk RNA-seq samples from the [TCGA](https://portal.gdc.cancer.gov/) dataset.
 ```
+from huggingface_hub import hf_hub_download
+import numpy as np
+import pandas as pd
+from transformers import AutoConfig, AutoModel, AutoTokenizer
+# Load model and tokenizer.
+config = AutoConfig.from_pretrained(
+    "InstaDeepAI/BulkRNABert",
+    trust_remote_code=True,
+)
+config.embeddings_layers_to_save = (4,) # last transformer layer
+tokenizer = AutoTokenizer.from_pretrained("InstaDeepAI/BulkRNABert", trust_remote_code=True)
 model = AutoModel.from_pretrained(
     "InstaDeepAI/BulkRNABert",
+    config=config,
     trust_remote_code=True,
 )
+# Load bulk RNA-seq data and preprocess them.
+csv_path = hf_hub_download(
+    repo_id="InstaDeepAI/BulkRNABert",
+    filename="data/tcga_sample.csv",
+    repo_type="model",
+)
+gene_expression_array = pd.read_csv(csv_path).drop(["identifier"], axis=1).to_numpy()[:1, :]
+gene_expression_array = np.log10(1 + gene_expression_array)
+assert gene_expression_array.shape[1] == config.n_genes
+# Tokenize
+gene_expression_ids = tokenizer.batch_encode_plus(gene_expression_array, return_tensors="pt")["input_ids"]
+# Compute BulkRNABert's embeddings
+gene_expression_mean_embeddings = model(gene_expression_ids)["embeddings_4"].mean(axis=1)  # embeddings can be used for downstream tasks.
+```
 ### Citing our work