--- license: cc-by-nc-sa-4.0 widget: - text: >- AGTCCAGTGGACGACCAGCCACGGCTCCGGTCTGTAGAACCATCGCGGAAACGGCTCGCAAAACTCTAAACAGCGCAAACGATGCGCGCGCCGAAGCAACCCGGCTCTACTTATAAAAACGTCCAACGGTGAGCACCGAGCAGCTACTACTCGTACTCCCCCCACCGATC tags: - DNA - biology - genomics datasets: - zhangtaolab/plant-multi-species-promoter-strength metrics: - r_squared base_model: - zhangtaolab/plant-nucleotide-transformer-BPE --- # Plant foundation DNA large language models The plant DNA large language models (LLMs) contain a series of foundation models based on different model architectures, which are pre-trained on various plant reference genomes. All the models have a comparable model size between 90 MB and 150 MB, BPE tokenizer is used for tokenization and 8000 tokens are included in the vocabulary. **Developed by:** zhangtaolab ### Model Sources - **Repository:** [Plant DNA LLMs](https://github.com/zhangtaolab/plant_DNA_LLMs) - **Manuscript:** [Versatile applications of foundation DNA large language models in plant genomes]() ### Architecture The model is trained based on the InstaDeepAI/nucleotide-transformer-v2-100m-multi-species model with modified tokenizer that replaces k-mer to BPE. This model is fine-tuned for predicting promoter strength in maize protoplasts system. ### How to use Install the runtime library first: ```bash pip install transformers ``` Here is a simple code for inference: ```python from transformers import AutoModelForSequenceClassification, AutoTokenizer, pipeline model_name = 'plant-nucleotide-transformer-BPE-promoter_strength_protoplast' # load model and tokenizer model = AutoModelForSequenceClassification.from_pretrained(f'zhangtaolab/{model_name}', trust_remote_code=True) tokenizer = AutoTokenizer.from_pretrained(f'zhangtaolab/{model_name}', trust_remote_code=True) # inference sequences = ['TACTCTAATCGTATCAGCTGCACTTGCGTACAGGCTACCGGCGTCCTCAGCCACGTAAGAAAAGGCCCAATAAAGGCCCAACTACAACCAGCGGATATATATACTGGAGCCTGGCGAGATCACCCTAACCCCTCACACTCCCATCCAGCCGCCACCAGGTGCAGAGTGTT', 'ATTTCAAAACTAGTTTTCTATAAACGAAAACTTATATTTATTCCGCTTGTTCCGTTTGATCTGCTGATTCGACACCGTTTTAACGTATTTTAAGTAAGTATCAGAAATATTAATGTGAAGATAAAAGAAAATAGAGTAAATGTAAAGGAAAATGCATAAGATTTTGTTGA'] pipe = pipeline('text-classification', model=model, tokenizer=tokenizer, trust_remote_code=True, function_to_apply="none") results = pipe(sequences) print(results) ``` ### Training data We use EsmForSequenceClassification to fine-tune the model. Detailed training procedure can be found in our manuscript. #### Hardware Model was trained on a NVIDIA GTX1080Ti GPU (11 GB).