Update README.md
Browse files
README.md
CHANGED
@@ -1,14 +1,14 @@
|
|
1 |
---
|
2 |
library_name: transformers
|
3 |
tags:
|
4 |
-
- bulk RNA-seq
|
5 |
-
- biology
|
6 |
-
- transcriptomics
|
7 |
---
|
8 |
|
9 |
# BulkRNABert
|
10 |
|
11 |
-
BulkRNABert is a transformer-based, encoder-only language model pre-trained on bulk RNA-seq
|
12 |
|
13 |
**Developed by:** [InstaDeep](https://huggingface.co/InstaDeepAI)
|
14 |
|
@@ -29,23 +29,47 @@ pip install --upgrade git+https://github.com/huggingface/transformers.git
|
|
29 |
pip install torch
|
30 |
```
|
31 |
|
32 |
-
|
|
|
|
|
|
|
33 |
|
34 |
```
|
35 |
-
import
|
36 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
37 |
|
|
|
38 |
model = AutoModel.from_pretrained(
|
39 |
"InstaDeepAI/BulkRNABert",
|
|
|
40 |
trust_remote_code=True,
|
41 |
)
|
42 |
|
43 |
-
|
44 |
-
|
45 |
-
|
46 |
-
|
|
|
|
|
|
|
|
|
|
|
47 |
|
48 |
-
|
|
|
|
|
|
|
|
|
|
|
49 |
|
50 |
|
51 |
### Citing our work
|
|
|
1 |
---
|
2 |
library_name: transformers
|
3 |
tags:
|
4 |
+
- bulk RNA-seq
|
5 |
+
- biology
|
6 |
+
- transcriptomics
|
7 |
---
|
8 |
|
9 |
# BulkRNABert
|
10 |
|
11 |
+
BulkRNABert is a transformer-based, encoder-only language model pre-trained on bulk RNA-seq profiles from the TCGA dataset using self-supervised masked language modeling, following the original BERT framework. The model is trained to reconstruct randomly masked gene expression values from their genomic context, enabling it to learn biologically meaningful representations of transcriptomic profiles. Once pre-trained, BulkRNABert can be fine-tuned for various cancer-related downstream tasks—such as cancer type classification or survival analysis—by extracting embeddings from the model.
|
12 |
|
13 |
**Developed by:** [InstaDeep](https://huggingface.co/InstaDeepAI)
|
14 |
|
|
|
29 |
pip install torch
|
30 |
```
|
31 |
|
32 |
+
## Other notes
|
33 |
+
We also provide the params for the BulkRNABert jax model in `jax_params`.
|
34 |
+
|
35 |
+
A small snippet of code is provided below to run inference with the model using bulk RNA-seq samples from the [TCGA](https://portal.gdc.cancer.gov/) dataset.
|
36 |
|
37 |
```
|
38 |
+
from huggingface_hub import hf_hub_download
|
39 |
+
import numpy as np
|
40 |
+
import pandas as pd
|
41 |
+
from transformers import AutoConfig, AutoModel, AutoTokenizer
|
42 |
+
|
43 |
+
# Load model and tokenizer.
|
44 |
+
config = AutoConfig.from_pretrained(
|
45 |
+
"InstaDeepAI/BulkRNABert",
|
46 |
+
trust_remote_code=True,
|
47 |
+
)
|
48 |
+
config.embeddings_layers_to_save = (4,) # last transformer layer
|
49 |
|
50 |
+
tokenizer = AutoTokenizer.from_pretrained("InstaDeepAI/BulkRNABert", trust_remote_code=True)
|
51 |
model = AutoModel.from_pretrained(
|
52 |
"InstaDeepAI/BulkRNABert",
|
53 |
+
config=config,
|
54 |
trust_remote_code=True,
|
55 |
)
|
56 |
|
57 |
+
# Load bulk RNA-seq data and preprocess them.
|
58 |
+
csv_path = hf_hub_download(
|
59 |
+
repo_id="InstaDeepAI/BulkRNABert",
|
60 |
+
filename="data/tcga_sample.csv",
|
61 |
+
repo_type="model",
|
62 |
+
)
|
63 |
+
gene_expression_array = pd.read_csv(csv_path).drop(["identifier"], axis=1).to_numpy()[:1, :]
|
64 |
+
gene_expression_array = np.log10(1 + gene_expression_array)
|
65 |
+
assert gene_expression_array.shape[1] == config.n_genes
|
66 |
|
67 |
+
# Tokenize
|
68 |
+
gene_expression_ids = tokenizer.batch_encode_plus(gene_expression_array, return_tensors="pt")["input_ids"]
|
69 |
+
|
70 |
+
# Compute BulkRNABert's embeddings
|
71 |
+
gene_expression_mean_embeddings = model(gene_expression_ids)["embeddings_4"].mean(axis=1) # embeddings can be used for downstream tasks.
|
72 |
+
```
|
73 |
|
74 |
|
75 |
### Citing our work
|