mgelard commited on
Commit
ae29c1b
·
verified ·
1 Parent(s): 769787f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +36 -12
README.md CHANGED
@@ -1,14 +1,14 @@
1
  ---
2
  library_name: transformers
3
  tags:
4
- - bulk RNA-seq
5
- - biology
6
- - transcriptomics
7
  ---
8
 
9
  # BulkRNABert
10
 
11
- BulkRNABert is a transformer-based, encoder-only language model pre-trained on bulk RNA-seq data using self-supervision via masked language modeling, following BERT’s method. It can be further fine-tuned for cancer type classification and survival time prediction on the TCGA dataset.
12
 
13
  **Developed by:** [InstaDeep](https://huggingface.co/InstaDeepAI)
14
 
@@ -29,23 +29,47 @@ pip install --upgrade git+https://github.com/huggingface/transformers.git
29
  pip install torch
30
  ```
31
 
32
- A small snippet of code is provided below to run inference with the model using random input.
 
 
 
33
 
34
  ```
35
- import torch
36
- from transformers import AutoConfig, AutoModel
 
 
 
 
 
 
 
 
 
37
 
 
38
  model = AutoModel.from_pretrained(
39
  "InstaDeepAI/BulkRNABert",
 
40
  trust_remote_code=True,
41
  )
42
 
43
- n_genes = model.config.n_genes
44
- dummy_gene_expressions = torch.randint(0, model.config.n_expressions_bins, (1, n_genes))
45
- torch_output = model(dummy_gene_expressions)
46
- ```
 
 
 
 
 
47
 
48
- A more complete example is provided in the repository.
 
 
 
 
 
49
 
50
 
51
  ### Citing our work
 
1
  ---
2
  library_name: transformers
3
  tags:
4
+ - bulk RNA-seq
5
+ - biology
6
+ - transcriptomics
7
  ---
8
 
9
  # BulkRNABert
10
 
11
+ BulkRNABert is a transformer-based, encoder-only language model pre-trained on bulk RNA-seq profiles from the TCGA dataset using self-supervised masked language modeling, following the original BERT framework. The model is trained to reconstruct randomly masked gene expression values from their genomic context, enabling it to learn biologically meaningful representations of transcriptomic profiles. Once pre-trained, BulkRNABert can be fine-tuned for various cancer-related downstream tasks—such as cancer type classification or survival analysis—by extracting embeddings from the model.
12
 
13
  **Developed by:** [InstaDeep](https://huggingface.co/InstaDeepAI)
14
 
 
29
  pip install torch
30
  ```
31
 
32
+ ## Other notes
33
+ We also provide the params for the BulkRNABert jax model in `jax_params`.
34
+
35
+ A small snippet of code is provided below to run inference with the model using bulk RNA-seq samples from the [TCGA](https://portal.gdc.cancer.gov/) dataset.
36
 
37
  ```
38
+ from huggingface_hub import hf_hub_download
39
+ import numpy as np
40
+ import pandas as pd
41
+ from transformers import AutoConfig, AutoModel, AutoTokenizer
42
+
43
+ # Load model and tokenizer.
44
+ config = AutoConfig.from_pretrained(
45
+ "InstaDeepAI/BulkRNABert",
46
+ trust_remote_code=True,
47
+ )
48
+ config.embeddings_layers_to_save = (4,) # last transformer layer
49
 
50
+ tokenizer = AutoTokenizer.from_pretrained("InstaDeepAI/BulkRNABert", trust_remote_code=True)
51
  model = AutoModel.from_pretrained(
52
  "InstaDeepAI/BulkRNABert",
53
+ config=config,
54
  trust_remote_code=True,
55
  )
56
 
57
+ # Load bulk RNA-seq data and preprocess them.
58
+ csv_path = hf_hub_download(
59
+ repo_id="InstaDeepAI/BulkRNABert",
60
+ filename="data/tcga_sample.csv",
61
+ repo_type="model",
62
+ )
63
+ gene_expression_array = pd.read_csv(csv_path).drop(["identifier"], axis=1).to_numpy()[:1, :]
64
+ gene_expression_array = np.log10(1 + gene_expression_array)
65
+ assert gene_expression_array.shape[1] == config.n_genes
66
 
67
+ # Tokenize
68
+ gene_expression_ids = tokenizer.batch_encode_plus(gene_expression_array, return_tensors="pt")["input_ids"]
69
+
70
+ # Compute BulkRNABert's embeddings
71
+ gene_expression_mean_embeddings = model(gene_expression_ids)["embeddings_4"].mean(axis=1) # embeddings can be used for downstream tasks.
72
+ ```
73
 
74
 
75
  ### Citing our work