nferruz commited on
Commit
eff9c6d
1 Parent(s): 5f81a43

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +5 -5
README.md CHANGED
@@ -35,13 +35,13 @@ output = model.generate(input_ids, top_k=8, repetition_penalty=1.2, max_length=1
35
  eos_token_id=1,pad_token_id=0,do_sample=True, num_return_sequences=100)
36
  ```
37
 
38
- **Example 2: Finetuning on a set of user-defined sequences**
39
 
40
  This alternative option to the zero-shot generation permits further improve the model's confidence for EC number with few members. User-defined training and validation files containing the sequences of interest are provided to the model. After a short update of the model's weights, ZymCTRL will generate sequences that follow the input properties. This might not be necessary in cases where the model has already seen many sequences per EC class.
41
 
42
  To create the validation and training file, it is necessary to
43
  (1) remove the FASTA headers for each sequence,
44
- (2) prepare the sequences in the format: `EC number<sep><start>S E Q U E N C E<end><|endoftext|>` and (3) split the originating dataset into training and validation files (this is often done with the ratio 90/10, 80/20 or 95/5). Then, to finetune the model to the input sequences, we can use the example below. Here we show a learning rate of 1e-06, but ideally, the learning rate should be optimised in separate runs. After training, the finetuned model will be stored in the ./output folder. Lastly, ZymCTRL can generate the tailored sequences as shown in Example 1:
45
 
46
  ```
47
  python run_clm.py --model_name_or_path nferruz/ZymCTRL --train_file training.txt --validation_file validation.txt --tokenizer_name nferruz/ZymCTRL
@@ -52,7 +52,7 @@ The HuggingFace script run_clm.py can be found here: https://github.com/huggingf
52
 
53
  ### **How to select the best sequences**
54
 
55
- First of all, we recommend selecting only sequences where the padding token has been emitted. Because the generation occurs with a max_length parameter, Hugging Face generation function will truncate sequences that surpassed that length. Once the sequence has been emitted, select those with at least one <pad> token at the end. Otherwise you might be seeing truncated sequences by the length limit.
56
 
57
  Besides, we've observed that perplexity values correlate with AlphaFold2's plddt.
58
  We recommend computing perplexity for each sequence as follows:
@@ -67,7 +67,7 @@ def calculatePerplexity(sequence, model, tokenizer):
67
  # Generate sequences by loading model and tokenizer (previously downloaded)
68
  device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
69
  tokenizer = AutoTokenizer.from_pretrained('/path/to/tokenizer') # replace with the actual path
70
- model = GPT2LMHeadModel.from_pretrained('/path/to/output').to(device)
71
  output = model.generate("1.1.1.1", max_length=400, do_sample=True, top_k=8, repetition_penalty=1.2, num_return_sequences=10, eos_token_id=0)
72
 
73
  # Take (for example) the first sequence
@@ -76,7 +76,7 @@ ppl = calculatePerplexity(sequence, model, tokenizer)
76
  ```
77
 
78
  Where `ppl` is a value with the perplexity for that sequence.
79
- We do not yet have a threshold as of what perplexity value gives a 'good' or 'bad' sequence, but given the fast inference times, the best is to sample many sequences, order them by perplexity, and select those with the lower values (the lower the better).
80
 
81
 
82
  ### **Training specs**
 
35
  eos_token_id=1,pad_token_id=0,do_sample=True, num_return_sequences=100)
36
  ```
37
 
38
+ **Example 2: Fine-tuning on a set of user-defined sequences**
39
 
40
  This alternative option to the zero-shot generation permits further improve the model's confidence for EC number with few members. User-defined training and validation files containing the sequences of interest are provided to the model. After a short update of the model's weights, ZymCTRL will generate sequences that follow the input properties. This might not be necessary in cases where the model has already seen many sequences per EC class.
41
 
42
  To create the validation and training file, it is necessary to
43
  (1) remove the FASTA headers for each sequence,
44
+ (2) prepare the sequences in the format: `EC number<sep><start>S E Q U E N C E<end><|endoftext|>` and (3) split the originating dataset into training and validation files (this is often done with the ratio 90/10, 80/20 or 95/5). Then, to fine-tune the model to the input sequences, we can use the example below. Here we show a learning rate of 1e-06, but ideally, the learning rate should be optimised in separate runs. After training, the fine-tuned model will be stored in the ./output folder. Lastly, ZymCTRL can generate the tailored sequences as shown in Example 1:
45
 
46
  ```
47
  python run_clm.py --model_name_or_path nferruz/ZymCTRL --train_file training.txt --validation_file validation.txt --tokenizer_name nferruz/ZymCTRL
 
52
 
53
  ### **How to select the best sequences**
54
 
55
+ First, we recommend selecting only sequences where the padding token has been emitted. Because the generation occurs with a max_length parameter, the HuggingFace generation function will truncate sequences that surpass that length. Once the sequence has been emitted, select those with at least one <pad> token at the end. Otherwise, you might be seeing truncated sequences by the length limit.
56
 
57
  Besides, we've observed that perplexity values correlate with AlphaFold2's plddt.
58
  We recommend computing perplexity for each sequence as follows:
 
67
  # Generate sequences by loading model and tokenizer (previously downloaded)
68
  device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
69
  tokenizer = AutoTokenizer.from_pretrained('/path/to/tokenizer') # replace with the actual path
70
+ model = GPT2LMHeadModel.from_pretrained('/path/to/ZymCTRL').to(device)
71
  output = model.generate("1.1.1.1", max_length=400, do_sample=True, top_k=8, repetition_penalty=1.2, num_return_sequences=10, eos_token_id=0)
72
 
73
  # Take (for example) the first sequence
 
76
  ```
77
 
78
  Where `ppl` is a value with the perplexity for that sequence.
79
+ We do not yet have a threshold as to what perplexity value gives a 'good' or 'bad' sequence, but given the fast inference times, the best is to sample many sequences, order them by perplexity, and select those with the lower values (the lower the better).
80
 
81
 
82
  ### **Training specs**