Noelia Ferruz commited on
Commit
4325144
1 Parent(s): f85dcf3

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +18 -0
README.md CHANGED
@@ -52,6 +52,24 @@ python run_clm.py --model_name_or_path nferruz/ProtGPT2 --train_file training.tx
52
  ```
53
  The HuggingFace script run_clm.py can be found here: https://github.com/huggingface/transformers/blob/master/examples/pytorch/language-modeling/run_clm.py
54
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
55
 
56
 
57
  ### **Training specs**
 
52
  ```
53
  The HuggingFace script run_clm.py can be found here: https://github.com/huggingface/transformers/blob/master/examples/pytorch/language-modeling/run_clm.py
54
 
55
+ ### **How to select the best sequences**
56
+ We've observed that perplexity values correlate with AlphaFold2's plddt. This plot shows perplexity vs. pldtt values for each of the 10,000 sequences in the ProtGPT2-generated dataset:
57
+ <div align="center">
58
+ <img src="https://huggingface.co/nferruz/ProtGPT2/edit/main/ppl-plddt.png" width="45%" />
59
+ </div>
60
+
61
+
62
+ We recommend to compute perplexity for each sequence with the HuggingFace evaluate method `perplexity`:
63
+
64
+ ```
65
+ from evaluate import load
66
+ perplexity = load("perplexity", module_type="metric")
67
+ results = perplexity.compute(predictions=predictions, model_id='nferruz/ProtGPT2')
68
+ ```
69
+
70
+ Where `predictions` is a list containing the generated sequences.
71
+ As a rule of thumb, sequences with perplexity values below 72 are more likely to have plddt values in line with natural sequences.
72
+
73
 
74
 
75
  ### **Training specs**