File size: 2,756 Bytes
508087f be067e6 508087f be067e6 508087f |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 |
# GenerRNA
GenerRNA is a generative RNA language model based on a Transformer decoder-only architecture. It was pre-trained on 30M sequences, encompassing 17B nucleotides.
Here, you can find all the relevant scripts for running GenerRNA on your machine. GenerRNA enable you to generate RNA sequences in a zero-shot manner for exploring the RNA space, or to fine-tune the model using a specific dataset for generating RNAs belonging to a particular family or possessing specific characteristics.
# Requirements
A CUDA environment, and a minimum VRAM of 8GB was required.
### Dependencies
```
torch>=2.0
numpy
transformers==4.33.0.dev0
datasets==2.14.4
tqdm
```
# Usage
Firstly, combine the split model using the command `cat model.pt.part-* > model.pt.recombined`
#### Directory tree
```
.
βββ LICENSE
βββ README.md
βββ configs
β βββ example_finetuning.py
β βββ example_pretraining.py
βββ experiments_data
βββ model.pt.part-aa # splited bin data of *HISTORICAL* model (shorter context window, less VRAM comsuption)
βββ model.pt.part-ab
βββ model.pt.part-ac
βββ model.pt.part-ad
βββ model_updated.pt # *NEWER* model, with longer context windows and being trained on a deduplicated dataset
βββ model.py # define the architecture
βββ sampling.py # script to generate sequences
βββ tokenization.py # preparete data
βββ tokenizer_bpe_1024
β βββ tokenizer.json
β βββ ....
βββ train.py # script for training/fine-tuning
```
### De novo Generation in a zero-shot fashion
Usage example:
```
python sampling.py \
--out_path {output_file_path} \
--max_new_tokens 256 \
--ckpt_path {model.pt} \
--tokenizer_path {path_to_tokenizer_directory, e.g /tokenizer_bpe_1024}
```
### Pre-training or Fine-tuning on your own sequences
First, tokenize your sequence data, ensuring each sequence is on a separate line and there is no header.
```
python tokenization.py \
--data_dir {path_to_the_directory_containing_sequence_data} \
--file_name {file_name_of_sequence_data} \
--tokenizer_path {path_to_tokenizer_directory} \
--out_dir {directory_to_save_tokenized_data} \
--block_size 256
```
Next, refer to `./configs/example_**.py` to create a config file of GPT model.
Lastly, excute following command:
```
python train.py \
--config {path_to_your_config_file}
```
### Train your own tokenizer
Usage example:
```
python train_BPE.py \
--txt_file_path {path_to_training_file(txt,each sequence is on a separate line)} \
--vocab_size 50256 \
--new_tokenizer_path {directory_to_save_trained_tokenizer} \
```
# License
The source code is licensed MIT. See `LICENSE` |