ZhiyuanChen commited on
Commit
e75766e
1 Parent(s): cb8ffda

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +50 -50
README.md CHANGED
@@ -10,19 +10,19 @@ library_name: multimolecule
10
  pipeline_tag: fill-mask
11
  mask_token: "<mask>"
12
  widget:
13
- - example_title: "PRNP"
14
- text: "CTG<mask>AAGCGGCCCACGCGGACTGACGGGCGGGGG"
15
  output:
16
- - label: "GGC"
17
- score: 0.09496457129716873
18
- - label: "GAG"
19
- score: 0.09480331838130951
20
- - label: "GAC"
21
- score: 0.07397700101137161
22
- - label: "AAG"
23
- score: 0.07375374436378479
24
- - label: "GUG"
25
- score: 0.06565868109464645
26
  ---
27
 
28
  # mRNA-FM
@@ -94,7 +94,7 @@ RNA-FM is a [bert](https://huggingface.co/google-bert/bert-base-uncased)-style m
94
  - **Paper**: [Interpretable RNA Foundation Model from Unannotated Data for Highly Accurate RNA Structure and Function Predictions](https://doi.org/10.1101/2022.08.06.503062)
95
  - **Developed by**: Jiayang Chen, Zhihang Hu, Siqi Sun, Qingxiong Tan, Yixuan Wang, Qinze Yu, Licheng Zong, Liang Hong, Jin Xiao, Tao Shen, Irwin King, Yu Li
96
  - **Model type**: [BERT](https://huggingface.co/google-bert/bert-base-uncased) - [ESM](https://huggingface.co/facebook/esm2_t48_15B_UR50D)
97
- - **Original Repository**: [https://github.com/ml4bio/RNA-FM](https://github.com/ml4bio/RNA-FM)
98
 
99
  ## Usage
100
 
@@ -111,29 +111,29 @@ You can use this model directly with a pipeline for masked language modeling:
111
  ```python
112
  >>> import multimolecule # you must import multimolecule to register models
113
  >>> from transformers import pipeline
114
- >>> unmasker = pipeline('fill-mask', model='multimolecule/mrnafm')
115
- >>> unmasker("ctg<mask>aagcggcccacgcggactgacgggcggggg")
116
-
117
- [{'score': 0.09496457129716873,
118
- 'token': 67,
119
- 'token_str': 'GGC',
120
- 'sequence': 'CUG GGC AAG CGG CCC ACG CGG ACU GAC GGG CGG GGG'},
121
- {'score': 0.09480331838130951,
122
- 'token': 58,
123
- 'token_str': 'GAG',
124
- 'sequence': 'CUG GAG AAG CGG CCC ACG CGG ACU GAC GGG CGG GGG'},
125
- {'score': 0.07397700101137161,
126
- 'token': 57,
127
- 'token_str': 'GAC',
128
- 'sequence': 'CUG GAC AAG CGG CCC ACG CGG ACU GAC GGG CGG GGG'},
129
- {'score': 0.07375374436378479,
130
- 'token': 8,
131
- 'token_str': 'AAG',
132
- 'sequence': 'CUG AAG AAG CGG CCC ACG CGG ACU GAC GGG CGG GGG'},
133
- {'score': 0.06565868109464645,
134
- 'token': 73,
135
- 'token_str': 'GUG',
136
- 'sequence': 'CUG GUG AAG CGG CCC ACG CGG ACU GAC GGG CGG GGG'}]
137
  ```
138
 
139
  ### Downstream Use
@@ -146,11 +146,11 @@ Here is how to use this model to get the features of a given sequence in PyTorch
146
  from multimolecule import RnaTokenizer, RnaFmModel
147
 
148
 
149
- tokenizer = RnaTokenizer.from_pretrained('multimolecule/mrnafm')
150
- model = RnaFmModel.from_pretrained('multimolecule/mrnafm')
151
 
152
  text = "UAGCUUAUCAGACUGAUGUUGA"
153
- input = tokenizer(text, return_tensors='pt')
154
 
155
  output = model(**input)
156
  ```
@@ -166,17 +166,17 @@ import torch
166
  from multimolecule import RnaTokenizer, RnaFmForSequencePrediction
167
 
168
 
169
- tokenizer = RnaTokenizer.from_pretrained('multimolecule/mrnafm')
170
- model = RnaFmForSequencePrediction.from_pretrained('multimolecule/mrnafm')
171
 
172
  text = "UAGCUUAUCAGACUGAUGUUGA"
173
- input = tokenizer(text, return_tensors='pt')
174
  label = torch.tensor([1])
175
 
176
  output = model(**input, labels=label)
177
  ```
178
 
179
- #### Nucleotide Classification / Regression
180
 
181
  **Note**: This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for nucleotide classification or regression.
182
 
@@ -184,14 +184,14 @@ Here is how to use this model as backbone to fine-tune for a nucleotide-level ta
184
 
185
  ```python
186
  import torch
187
- from multimolecule import RnaTokenizer, RnaFmForNucleotidePrediction
188
 
189
 
190
- tokenizer = RnaTokenizer.from_pretrained('multimolecule/mrnafm')
191
- model = RnaFmForNucleotidePrediction.from_pretrained('multimolecule/mrnafm')
192
 
193
  text = "UAGCUUAUCAGACUGAUGUUGA"
194
- input = tokenizer(text, return_tensors='pt')
195
  label = torch.randint(2, (len(text), ))
196
 
197
  output = model(**input, labels=label)
@@ -208,11 +208,11 @@ import torch
208
  from multimolecule import RnaTokenizer, RnaFmForContactPrediction
209
 
210
 
211
- tokenizer = RnaTokenizer.from_pretrained('multimolecule/mrnafm')
212
- model = RnaFmForContactPrediction.from_pretrained('multimolecule/mrnafm')
213
 
214
  text = "UAGCUUAUCAGACUGAUGUUGA"
215
- input = tokenizer(text, return_tensors='pt')
216
  label = torch.randint(2, (len(text), len(text)))
217
 
218
  output = model(**input, labels=label)
 
10
  pipeline_tag: fill-mask
11
  mask_token: "<mask>"
12
  widget:
13
+ - example_title: "Homo sapiens PRNP mRNA for prion"
14
+ text: "AGC<mask>CAUUAUGGCGAACCUUGGCUGCUG"
15
  output:
16
+ - label: "AAA"
17
+ score: 0.05433480441570282
18
+ - label: "AUC"
19
+ score: 0.04437034949660301
20
+ - label: "AAU"
21
+ score: 0.03882088139653206
22
+ - label: "ACA"
23
+ score: 0.037016965448856354
24
+ - label: "ACC"
25
+ score: 0.03563101962208748
26
  ---
27
 
28
  # mRNA-FM
 
94
  - **Paper**: [Interpretable RNA Foundation Model from Unannotated Data for Highly Accurate RNA Structure and Function Predictions](https://doi.org/10.1101/2022.08.06.503062)
95
  - **Developed by**: Jiayang Chen, Zhihang Hu, Siqi Sun, Qingxiong Tan, Yixuan Wang, Qinze Yu, Licheng Zong, Liang Hong, Jin Xiao, Tao Shen, Irwin King, Yu Li
96
  - **Model type**: [BERT](https://huggingface.co/google-bert/bert-base-uncased) - [ESM](https://huggingface.co/facebook/esm2_t48_15B_UR50D)
97
+ - **Original Repository**: [ml4bio/RNA-FM](https://github.com/ml4bio/RNA-FM)
98
 
99
  ## Usage
100
 
 
111
  ```python
112
  >>> import multimolecule # you must import multimolecule to register models
113
  >>> from transformers import pipeline
114
+ >>> unmasker = pipeline("fill-mask", model="multimolecule/mrnafm")
115
+ >>> unmasker("agc<mask>cauuauggcgaaccuuggcugcug")
116
+
117
+ [{'score': 0.05433480441570282,
118
+ 'token': 6,
119
+ 'token_str': 'AAA',
120
+ 'sequence': 'AGC AAA CAU UAU GGC GAA CCU UGG CUG CUG'},
121
+ {'score': 0.04437034949660301,
122
+ 'token': 22,
123
+ 'token_str': 'AUC',
124
+ 'sequence': 'AGC AUC CAU UAU GGC GAA CCU UGG CUG CUG'},
125
+ {'score': 0.03882088139653206,
126
+ 'token': 9,
127
+ 'token_str': 'AAU',
128
+ 'sequence': 'AGC AAU CAU UAU GGC GAA CCU UGG CUG CUG'},
129
+ {'score': 0.037016965448856354,
130
+ 'token': 11,
131
+ 'token_str': 'ACA',
132
+ 'sequence': 'AGC ACA CAU UAU GGC GAA CCU UGG CUG CUG'},
133
+ {'score': 0.03563101962208748,
134
+ 'token': 12,
135
+ 'token_str': 'ACC',
136
+ 'sequence': 'AGC ACC CAU UAU GGC GAA CCU UGG CUG CUG'}]
137
  ```
138
 
139
  ### Downstream Use
 
146
  from multimolecule import RnaTokenizer, RnaFmModel
147
 
148
 
149
+ tokenizer = RnaTokenizer.from_pretrained("multimolecule/mrnafm")
150
+ model = RnaFmModel.from_pretrained("multimolecule/mrnafm")
151
 
152
  text = "UAGCUUAUCAGACUGAUGUUGA"
153
+ input = tokenizer(text, return_tensors="pt")
154
 
155
  output = model(**input)
156
  ```
 
166
  from multimolecule import RnaTokenizer, RnaFmForSequencePrediction
167
 
168
 
169
+ tokenizer = RnaTokenizer.from_pretrained("multimolecule/mrnafm")
170
+ model = RnaFmForSequencePrediction.from_pretrained("multimolecule/mrnafm")
171
 
172
  text = "UAGCUUAUCAGACUGAUGUUGA"
173
+ input = tokenizer(text, return_tensors="pt")
174
  label = torch.tensor([1])
175
 
176
  output = model(**input, labels=label)
177
  ```
178
 
179
+ #### Token Classification / Regression
180
 
181
  **Note**: This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for nucleotide classification or regression.
182
 
 
184
 
185
  ```python
186
  import torch
187
+ from multimolecule import RnaTokenizer, RnaFmForTokenPrediction
188
 
189
 
190
+ tokenizer = RnaTokenizer.from_pretrained("multimolecule/mrnafm")
191
+ model = RnaFmForTokenPrediction.from_pretrained("multimolecule/mrnafm")
192
 
193
  text = "UAGCUUAUCAGACUGAUGUUGA"
194
+ input = tokenizer(text, return_tensors="pt")
195
  label = torch.randint(2, (len(text), ))
196
 
197
  output = model(**input, labels=label)
 
208
  from multimolecule import RnaTokenizer, RnaFmForContactPrediction
209
 
210
 
211
+ tokenizer = RnaTokenizer.from_pretrained("multimolecule/mrnafm")
212
+ model = RnaFmForContactPrediction.from_pretrained("multimolecule/mrnafm")
213
 
214
  text = "UAGCUUAUCAGACUGAUGUUGA"
215
+ input = tokenizer(text, return_tensors="pt")
216
  label = torch.randint(2, (len(text), len(text)))
217
 
218
  output = model(**input, labels=label)