ZhiyuanChen
commited on
Commit
•
e75766e
1
Parent(s):
cb8ffda
Update README.md
Browse files
README.md
CHANGED
@@ -10,19 +10,19 @@ library_name: multimolecule
|
|
10 |
pipeline_tag: fill-mask
|
11 |
mask_token: "<mask>"
|
12 |
widget:
|
13 |
-
- example_title: "PRNP"
|
14 |
-
text: "
|
15 |
output:
|
16 |
-
- label: "
|
17 |
-
score: 0.
|
18 |
-
- label: "
|
19 |
-
score: 0.
|
20 |
-
- label: "
|
21 |
-
score: 0.
|
22 |
-
- label: "
|
23 |
-
score: 0.
|
24 |
-
- label: "
|
25 |
-
score: 0.
|
26 |
---
|
27 |
|
28 |
# mRNA-FM
|
@@ -94,7 +94,7 @@ RNA-FM is a [bert](https://huggingface.co/google-bert/bert-base-uncased)-style m
|
|
94 |
- **Paper**: [Interpretable RNA Foundation Model from Unannotated Data for Highly Accurate RNA Structure and Function Predictions](https://doi.org/10.1101/2022.08.06.503062)
|
95 |
- **Developed by**: Jiayang Chen, Zhihang Hu, Siqi Sun, Qingxiong Tan, Yixuan Wang, Qinze Yu, Licheng Zong, Liang Hong, Jin Xiao, Tao Shen, Irwin King, Yu Li
|
96 |
- **Model type**: [BERT](https://huggingface.co/google-bert/bert-base-uncased) - [ESM](https://huggingface.co/facebook/esm2_t48_15B_UR50D)
|
97 |
-
- **Original Repository**: [
|
98 |
|
99 |
## Usage
|
100 |
|
@@ -111,29 +111,29 @@ You can use this model directly with a pipeline for masked language modeling:
|
|
111 |
```python
|
112 |
>>> import multimolecule # you must import multimolecule to register models
|
113 |
>>> from transformers import pipeline
|
114 |
-
>>> unmasker = pipeline(
|
115 |
-
>>> unmasker("
|
116 |
-
|
117 |
-
[{'score': 0.
|
118 |
-
'token':
|
119 |
-
'token_str': '
|
120 |
-
'sequence': '
|
121 |
-
{'score': 0.
|
122 |
-
'token':
|
123 |
-
'token_str': '
|
124 |
-
'sequence': '
|
125 |
-
{'score': 0.
|
126 |
-
'token':
|
127 |
-
'token_str': '
|
128 |
-
'sequence': '
|
129 |
-
{'score': 0.
|
130 |
-
'token':
|
131 |
-
'token_str': '
|
132 |
-
'sequence': '
|
133 |
-
{'score': 0.
|
134 |
-
'token':
|
135 |
-
'token_str': '
|
136 |
-
'sequence': '
|
137 |
```
|
138 |
|
139 |
### Downstream Use
|
@@ -146,11 +146,11 @@ Here is how to use this model to get the features of a given sequence in PyTorch
|
|
146 |
from multimolecule import RnaTokenizer, RnaFmModel
|
147 |
|
148 |
|
149 |
-
tokenizer = RnaTokenizer.from_pretrained(
|
150 |
-
model = RnaFmModel.from_pretrained(
|
151 |
|
152 |
text = "UAGCUUAUCAGACUGAUGUUGA"
|
153 |
-
input = tokenizer(text, return_tensors=
|
154 |
|
155 |
output = model(**input)
|
156 |
```
|
@@ -166,17 +166,17 @@ import torch
|
|
166 |
from multimolecule import RnaTokenizer, RnaFmForSequencePrediction
|
167 |
|
168 |
|
169 |
-
tokenizer = RnaTokenizer.from_pretrained(
|
170 |
-
model = RnaFmForSequencePrediction.from_pretrained(
|
171 |
|
172 |
text = "UAGCUUAUCAGACUGAUGUUGA"
|
173 |
-
input = tokenizer(text, return_tensors=
|
174 |
label = torch.tensor([1])
|
175 |
|
176 |
output = model(**input, labels=label)
|
177 |
```
|
178 |
|
179 |
-
####
|
180 |
|
181 |
**Note**: This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for nucleotide classification or regression.
|
182 |
|
@@ -184,14 +184,14 @@ Here is how to use this model as backbone to fine-tune for a nucleotide-level ta
|
|
184 |
|
185 |
```python
|
186 |
import torch
|
187 |
-
from multimolecule import RnaTokenizer,
|
188 |
|
189 |
|
190 |
-
tokenizer = RnaTokenizer.from_pretrained(
|
191 |
-
model =
|
192 |
|
193 |
text = "UAGCUUAUCAGACUGAUGUUGA"
|
194 |
-
input = tokenizer(text, return_tensors=
|
195 |
label = torch.randint(2, (len(text), ))
|
196 |
|
197 |
output = model(**input, labels=label)
|
@@ -208,11 +208,11 @@ import torch
|
|
208 |
from multimolecule import RnaTokenizer, RnaFmForContactPrediction
|
209 |
|
210 |
|
211 |
-
tokenizer = RnaTokenizer.from_pretrained(
|
212 |
-
model = RnaFmForContactPrediction.from_pretrained(
|
213 |
|
214 |
text = "UAGCUUAUCAGACUGAUGUUGA"
|
215 |
-
input = tokenizer(text, return_tensors=
|
216 |
label = torch.randint(2, (len(text), len(text)))
|
217 |
|
218 |
output = model(**input, labels=label)
|
|
|
10 |
pipeline_tag: fill-mask
|
11 |
mask_token: "<mask>"
|
12 |
widget:
|
13 |
+
- example_title: "Homo sapiens PRNP mRNA for prion"
|
14 |
+
text: "AGC<mask>CAUUAUGGCGAACCUUGGCUGCUG"
|
15 |
output:
|
16 |
+
- label: "AAA"
|
17 |
+
score: 0.05433480441570282
|
18 |
+
- label: "AUC"
|
19 |
+
score: 0.04437034949660301
|
20 |
+
- label: "AAU"
|
21 |
+
score: 0.03882088139653206
|
22 |
+
- label: "ACA"
|
23 |
+
score: 0.037016965448856354
|
24 |
+
- label: "ACC"
|
25 |
+
score: 0.03563101962208748
|
26 |
---
|
27 |
|
28 |
# mRNA-FM
|
|
|
94 |
- **Paper**: [Interpretable RNA Foundation Model from Unannotated Data for Highly Accurate RNA Structure and Function Predictions](https://doi.org/10.1101/2022.08.06.503062)
|
95 |
- **Developed by**: Jiayang Chen, Zhihang Hu, Siqi Sun, Qingxiong Tan, Yixuan Wang, Qinze Yu, Licheng Zong, Liang Hong, Jin Xiao, Tao Shen, Irwin King, Yu Li
|
96 |
- **Model type**: [BERT](https://huggingface.co/google-bert/bert-base-uncased) - [ESM](https://huggingface.co/facebook/esm2_t48_15B_UR50D)
|
97 |
+
- **Original Repository**: [ml4bio/RNA-FM](https://github.com/ml4bio/RNA-FM)
|
98 |
|
99 |
## Usage
|
100 |
|
|
|
111 |
```python
|
112 |
>>> import multimolecule # you must import multimolecule to register models
|
113 |
>>> from transformers import pipeline
|
114 |
+
>>> unmasker = pipeline("fill-mask", model="multimolecule/mrnafm")
|
115 |
+
>>> unmasker("agc<mask>cauuauggcgaaccuuggcugcug")
|
116 |
+
|
117 |
+
[{'score': 0.05433480441570282,
|
118 |
+
'token': 6,
|
119 |
+
'token_str': 'AAA',
|
120 |
+
'sequence': 'AGC AAA CAU UAU GGC GAA CCU UGG CUG CUG'},
|
121 |
+
{'score': 0.04437034949660301,
|
122 |
+
'token': 22,
|
123 |
+
'token_str': 'AUC',
|
124 |
+
'sequence': 'AGC AUC CAU UAU GGC GAA CCU UGG CUG CUG'},
|
125 |
+
{'score': 0.03882088139653206,
|
126 |
+
'token': 9,
|
127 |
+
'token_str': 'AAU',
|
128 |
+
'sequence': 'AGC AAU CAU UAU GGC GAA CCU UGG CUG CUG'},
|
129 |
+
{'score': 0.037016965448856354,
|
130 |
+
'token': 11,
|
131 |
+
'token_str': 'ACA',
|
132 |
+
'sequence': 'AGC ACA CAU UAU GGC GAA CCU UGG CUG CUG'},
|
133 |
+
{'score': 0.03563101962208748,
|
134 |
+
'token': 12,
|
135 |
+
'token_str': 'ACC',
|
136 |
+
'sequence': 'AGC ACC CAU UAU GGC GAA CCU UGG CUG CUG'}]
|
137 |
```
|
138 |
|
139 |
### Downstream Use
|
|
|
146 |
from multimolecule import RnaTokenizer, RnaFmModel
|
147 |
|
148 |
|
149 |
+
tokenizer = RnaTokenizer.from_pretrained("multimolecule/mrnafm")
|
150 |
+
model = RnaFmModel.from_pretrained("multimolecule/mrnafm")
|
151 |
|
152 |
text = "UAGCUUAUCAGACUGAUGUUGA"
|
153 |
+
input = tokenizer(text, return_tensors="pt")
|
154 |
|
155 |
output = model(**input)
|
156 |
```
|
|
|
166 |
from multimolecule import RnaTokenizer, RnaFmForSequencePrediction
|
167 |
|
168 |
|
169 |
+
tokenizer = RnaTokenizer.from_pretrained("multimolecule/mrnafm")
|
170 |
+
model = RnaFmForSequencePrediction.from_pretrained("multimolecule/mrnafm")
|
171 |
|
172 |
text = "UAGCUUAUCAGACUGAUGUUGA"
|
173 |
+
input = tokenizer(text, return_tensors="pt")
|
174 |
label = torch.tensor([1])
|
175 |
|
176 |
output = model(**input, labels=label)
|
177 |
```
|
178 |
|
179 |
+
#### Token Classification / Regression
|
180 |
|
181 |
**Note**: This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for nucleotide classification or regression.
|
182 |
|
|
|
184 |
|
185 |
```python
|
186 |
import torch
|
187 |
+
from multimolecule import RnaTokenizer, RnaFmForTokenPrediction
|
188 |
|
189 |
|
190 |
+
tokenizer = RnaTokenizer.from_pretrained("multimolecule/mrnafm")
|
191 |
+
model = RnaFmForTokenPrediction.from_pretrained("multimolecule/mrnafm")
|
192 |
|
193 |
text = "UAGCUUAUCAGACUGAUGUUGA"
|
194 |
+
input = tokenizer(text, return_tensors="pt")
|
195 |
label = torch.randint(2, (len(text), ))
|
196 |
|
197 |
output = model(**input, labels=label)
|
|
|
208 |
from multimolecule import RnaTokenizer, RnaFmForContactPrediction
|
209 |
|
210 |
|
211 |
+
tokenizer = RnaTokenizer.from_pretrained("multimolecule/mrnafm")
|
212 |
+
model = RnaFmForContactPrediction.from_pretrained("multimolecule/mrnafm")
|
213 |
|
214 |
text = "UAGCUUAUCAGACUGAUGUUGA"
|
215 |
+
input = tokenizer(text, return_tensors="pt")
|
216 |
label = torch.randint(2, (len(text), len(text)))
|
217 |
|
218 |
output = model(**input, labels=label)
|