neuralbioinfo
/

prokbert-mini

sequence embedding

genomic language models

promoter-prediction

Model card Files Files and versions Community

ligeti commited on Jan 24, 2024

Commit

ccc8eb9

·

verified ·

1 Parent(s): f1bb9ca

Update README.md

Files changed (1) hide show

README.md +52 -2

README.md CHANGED Viewed

@@ -1,7 +1,43 @@
 ## ProkBERT-mini Model
 ProkBERT-mini-k6s1 is part of the ProkBERT family of genomic language models, specifically designed for microbiome applications. This model, optimized for DNA sequence analysis. This model can provide robust and high resolution solutions.
 ### Model Details
 **Developed by:** Neural Bioinformatics Research Group
@@ -109,6 +145,7 @@ try:
 except ImportError:
     !pip install prokbert
     print("Installed ProkBERT.")
 ### Training Data and Process
@@ -139,5 +176,18 @@ Please report any issues with the model or its outputs to the Neural Bioinformat
 - **Model issues:** [GitHub repository link](https://github.com/nbrg-ppcu/prokbert)
 - **Feedback and inquiries:** [[email protected]](mailto:[email protected])
----
-license: cc-by-nc-4.0

+---
+license: cc-by-nc-4.0
+---
 ## ProkBERT-mini Model
 ProkBERT-mini-k6s1 is part of the ProkBERT family of genomic language models, specifically designed for microbiome applications. This model, optimized for DNA sequence analysis. This model can provide robust and high resolution solutions.
+## Simple Usage Example
+The following example demonstrates how to use the ProkBERT-mini model for processing a DNA sequence:
+```python
+from transformers import MegatronBertForMaskedLM
+from prokbert.prokbert_tokenizer import ProkBERTTokenizer
+# Tokenization parameters
+tokenization_parameters = {
+    'kmer': 6,
+    'shift': 1
+}
+# Initialize the tokenizer and model
+tokenizer = ProkBERTTokenizer(tokenization_params=tokenization_parameters, operation_space='sequence')
+model = MegatronBertForMaskedLM.from_pretrained("nerualbioinfo/prokbert-mini-k6s2")
+# Example DNA sequence
+sequence = 'ATGTCCGCGGGACCT'
+# Tokenize the sequence
+inputs = tokenizer(sequence, return_tensors="pt")
+# Ensure that inputs have a batch dimension
+inputs = {key: value.unsqueeze(0) for key, value in inputs.items()}
+# Generate outputs from the model
+outputs = model(**inputs)
+```
 ### Model Details
 **Developed by:** Neural Bioinformatics Research Group
 except ImportError:
     !pip install prokbert
     print("Installed ProkBERT.")
+```
 ### Training Data and Process
 - **Model issues:** [GitHub repository link](https://github.com/nbrg-ppcu/prokbert)
 - **Feedback and inquiries:** [[email protected]](mailto:[email protected])
+## Reference
+If you use ProkBERT-mini in your research, please cite the following paper:
+@ARTICLE{10.3389/fmicb.2023.1331233,
+    AUTHOR={Ligeti, Balázs and Szepesi-Nagy, István and Bodnár, Babett and Ligeti-Nagy, Noémi and Juhász, János},
+    TITLE={ProkBERT family: genomic language models for microbiome applications},
+    JOURNAL={Frontiers in Microbiology},
+    VOLUME={14},
+    YEAR={2024},
+    URL={https://www.frontiersin.org/articles/10.3389/fmicb.2023.1331233},
+    DOI={10.3389/fmicb.2023.1331233},
+    ISSN={1664-302X},
+    ABSTRACT={...}
+}