littleworth
commited on
Commit
•
bf78fda
1
Parent(s):
653e8ed
Update README.md
Browse files
README.md
CHANGED
@@ -25,12 +25,54 @@ The distilled model, `protgpt2-distilled-tiny`, exhibits a significant improveme
|
|
25 |
|
26 |
![Evals](https://images.mobilism.org/?di=PYFQ1N5V)
|
27 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
28 |
|
29 |
### Use Cases
|
30 |
1. **High-Throughput Screening in Drug Discovery:** The distilled ProtGPT2 is ideal for rapid screening of mutation effects in protein sequences within pharmaceutical research. For example, it can quickly predict the stability of protein variants in large datasets, speeding up the identification of viable drug targets.
|
31 |
2. **Portable Diagnostics in Healthcare:** This model is suitable for use in handheld diagnostic devices that perform real-time protein analysis in clinical settings. For instance, it can be used in portable devices to analyze blood samples for markers of diseases, providing immediate results to healthcare providers in remote areas.
|
32 |
3. **Interactive Learning Tools in Academia:** The distilled model can be integrated into educational software tools that allow biology students to simulate and study the impact of genetic mutations on protein structures. This hands-on learning helps students understand protein dynamics without the need for high-end computational facilities.
|
33 |
-
|
34 |
### References
|
35 |
- Hinton, G., Vinyals, O., & Dean, J. (2015). Distilling the Knowledge in a Neural Network. arXiv:1503.02531.
|
36 |
- Original ProtGPT2 Paper: [Link to paper](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9329459/)
|
|
|
25 |
|
26 |
![Evals](https://images.mobilism.org/?di=PYFQ1N5V)
|
27 |
|
28 |
+
### Usage
|
29 |
+
|
30 |
+
```
|
31 |
+
from transformers import GPT2Tokenizer, GPT2LMHeadModel, TextGenerationPipeline
|
32 |
+
import re
|
33 |
+
|
34 |
+
# Load the model and tokenizer
|
35 |
+
model_name = "littleworth/protgpt2-distilled-tiny"
|
36 |
+
tokenizer = GPT2Tokenizer.from_pretrained(model_name)
|
37 |
+
model = GPT2LMHeadModel.from_pretrained(model_name)
|
38 |
+
|
39 |
+
# Ensure tokenizer is padding from the left
|
40 |
+
tokenizer.padding_side = "left"
|
41 |
+
|
42 |
+
# Initialize the pipeline
|
43 |
+
text_generator = TextGenerationPipeline(
|
44 |
+
model=model, tokenizer=tokenizer, device=0
|
45 |
+
) # specify device if needed
|
46 |
+
|
47 |
+
# Generate sequences
|
48 |
+
sequences = text_generator(
|
49 |
+
"<|endoftext|>",
|
50 |
+
max_length=100,
|
51 |
+
do_sample=True,
|
52 |
+
top_k=950,
|
53 |
+
repetition_penalty=1.2,
|
54 |
+
num_return_sequences=10,
|
55 |
+
pad_token_id=tokenizer.eos_token_id, # Set pad_token_id to eos_token_id
|
56 |
+
eos_token_id=0,
|
57 |
+
truncation=True,
|
58 |
+
)
|
59 |
+
|
60 |
+
for i, seq in enumerate(sequences):
|
61 |
+
seq["generated_text"] = seq["generated_text"].replace("<|endoftext|>", "")
|
62 |
+
|
63 |
+
# Remove newline characters and non-alphabetical characters
|
64 |
+
seq["generated_text"] = "".join(
|
65 |
+
char for char in seq["generated_text"] if char.isalpha()
|
66 |
+
)
|
67 |
+
print(f">Seq_{i}")
|
68 |
+
print(seq["generated_text"])
|
69 |
+
```
|
70 |
|
71 |
### Use Cases
|
72 |
1. **High-Throughput Screening in Drug Discovery:** The distilled ProtGPT2 is ideal for rapid screening of mutation effects in protein sequences within pharmaceutical research. For example, it can quickly predict the stability of protein variants in large datasets, speeding up the identification of viable drug targets.
|
73 |
2. **Portable Diagnostics in Healthcare:** This model is suitable for use in handheld diagnostic devices that perform real-time protein analysis in clinical settings. For instance, it can be used in portable devices to analyze blood samples for markers of diseases, providing immediate results to healthcare providers in remote areas.
|
74 |
3. **Interactive Learning Tools in Academia:** The distilled model can be integrated into educational software tools that allow biology students to simulate and study the impact of genetic mutations on protein structures. This hands-on learning helps students understand protein dynamics without the need for high-end computational facilities.
|
75 |
+
|
76 |
### References
|
77 |
- Hinton, G., Vinyals, O., & Dean, J. (2015). Distilling the Knowledge in a Neural Network. arXiv:1503.02531.
|
78 |
- Original ProtGPT2 Paper: [Link to paper](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9329459/)
|