|
# GenerRNA |
|
GenerRNA is a generative RNA language model based on a Transformer decoder-only architecture. It was pre-trained on 30M sequences, encompassing 17B nucleotides. |
|
|
|
Here, you can find all the relevant scripts for running GenerRNA on your machine. GenerRNA enable you to generate RNA sequences in a zero-shot manner for exploring the RNA space, or to fine-tune the model using a specific dataset for generating RNAs belonging to a particular family or possessing specific characteristics. |
|
|
|
# Requirements |
|
A CUDA environment, and a minimum VRAM of 8GB was required. |
|
### Dependencies |
|
``` |
|
torch>=2.0 |
|
numpy |
|
transformers==4.33.0.dev0 |
|
datasets==2.14.4 |
|
tqdm |
|
``` |
|
|
|
# Usage |
|
Firstly, combine the split model using the command `cat model.pt.part-* > model.pt.recombined` |
|
#### Directory tree |
|
``` |
|
. |
|
βββ LICENSE |
|
βββ README.md |
|
βββ configs |
|
β βββ example_finetuning.py |
|
β βββ example_pretraining.py |
|
βββ experiments_data |
|
βββ model.pt.part-aa # splited bin data of *HISTORICAL* model (shorter context window, less VRAM comsuption) |
|
βββ model.pt.part-ab |
|
βββ model.pt.part-ac |
|
βββ model.pt.part-ad |
|
βββ model_updated.pt # *NEWER* model, with longer context windows and being trained on a deduplicated dataset |
|
βββ model.py # define the architecture |
|
βββ sampling.py # script to generate sequences |
|
βββ tokenization.py # preparete data |
|
βββ tokenizer_bpe_1024 |
|
β βββ tokenizer.json |
|
β βββ .... |
|
βββ train.py # script for training/fine-tuning |
|
``` |
|
|
|
### De novo Generation in a zero-shot fashion |
|
Usage example: |
|
``` |
|
python sampling.py \ |
|
--out_path {output_file_path} \ |
|
--max_new_tokens 256 \ |
|
--ckpt_path {model.pt} \ |
|
--tokenizer_path {path_to_tokenizer_directory, e.g /tokenizer_bpe_1024} |
|
``` |
|
### Pre-training or Fine-tuning on your own sequences |
|
First, tokenize your sequence data, ensuring each sequence is on a separate line and there is no header. |
|
``` |
|
python tokenization.py \ |
|
--data_dir {path_to_the_directory_containing_sequence_data} \ |
|
--file_name {file_name_of_sequence_data} \ |
|
--tokenizer_path {path_to_tokenizer_directory} \ |
|
--out_dir {directory_to_save_tokenized_data} \ |
|
--block_size 256 |
|
``` |
|
|
|
Next, refer to `./configs/example_**.py` to create a config file of GPT model. |
|
|
|
Lastly, excute following command: |
|
``` |
|
python train.py \ |
|
--config {path_to_your_config_file} |
|
``` |
|
|
|
### Train your own tokenizer |
|
Usage example: |
|
``` |
|
python train_BPE.py \ |
|
--txt_file_path {path_to_training_file(txt,each sequence is on a separate line)} \ |
|
--vocab_size 50256 \ |
|
--new_tokenizer_path {directory_to_save_trained_tokenizer} \ |
|
|
|
``` |
|
|
|
# License |
|
The source code is licensed MIT. See `LICENSE` |