ligeti commited on
Commit
ccc8eb9
·
verified ·
1 Parent(s): f1bb9ca

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +52 -2
README.md CHANGED
@@ -1,7 +1,43 @@
 
 
 
 
 
1
  ## ProkBERT-mini Model
2
 
3
  ProkBERT-mini-k6s1 is part of the ProkBERT family of genomic language models, specifically designed for microbiome applications. This model, optimized for DNA sequence analysis. This model can provide robust and high resolution solutions.
4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5
  ### Model Details
6
 
7
  **Developed by:** Neural Bioinformatics Research Group
@@ -109,6 +145,7 @@ try:
109
  except ImportError:
110
  !pip install prokbert
111
  print("Installed ProkBERT.")
 
112
 
113
  ### Training Data and Process
114
 
@@ -139,5 +176,18 @@ Please report any issues with the model or its outputs to the Neural Bioinformat
139
  - **Model issues:** [GitHub repository link](https://github.com/nbrg-ppcu/prokbert)
140
  - **Feedback and inquiries:** [[email protected]](mailto:[email protected])
141
 
142
- ---
143
- license: cc-by-nc-4.0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-nc-4.0
3
+ ---
4
+
5
+
6
  ## ProkBERT-mini Model
7
 
8
  ProkBERT-mini-k6s1 is part of the ProkBERT family of genomic language models, specifically designed for microbiome applications. This model, optimized for DNA sequence analysis. This model can provide robust and high resolution solutions.
9
 
10
+ ## Simple Usage Example
11
+
12
+ The following example demonstrates how to use the ProkBERT-mini model for processing a DNA sequence:
13
+
14
+ ```python
15
+ from transformers import MegatronBertForMaskedLM
16
+ from prokbert.prokbert_tokenizer import ProkBERTTokenizer
17
+
18
+ # Tokenization parameters
19
+ tokenization_parameters = {
20
+ 'kmer': 6,
21
+ 'shift': 1
22
+ }
23
+
24
+ # Initialize the tokenizer and model
25
+ tokenizer = ProkBERTTokenizer(tokenization_params=tokenization_parameters, operation_space='sequence')
26
+ model = MegatronBertForMaskedLM.from_pretrained("nerualbioinfo/prokbert-mini-k6s2")
27
+
28
+ # Example DNA sequence
29
+ sequence = 'ATGTCCGCGGGACCT'
30
+
31
+ # Tokenize the sequence
32
+ inputs = tokenizer(sequence, return_tensors="pt")
33
+
34
+ # Ensure that inputs have a batch dimension
35
+ inputs = {key: value.unsqueeze(0) for key, value in inputs.items()}
36
+
37
+ # Generate outputs from the model
38
+ outputs = model(**inputs)
39
+ ```
40
+
41
  ### Model Details
42
 
43
  **Developed by:** Neural Bioinformatics Research Group
 
145
  except ImportError:
146
  !pip install prokbert
147
  print("Installed ProkBERT.")
148
+ ```
149
 
150
  ### Training Data and Process
151
 
 
176
  - **Model issues:** [GitHub repository link](https://github.com/nbrg-ppcu/prokbert)
177
  - **Feedback and inquiries:** [[email protected]](mailto:[email protected])
178
 
179
+ ## Reference
180
+
181
+ If you use ProkBERT-mini in your research, please cite the following paper:
182
+ @ARTICLE{10.3389/fmicb.2023.1331233,
183
+ AUTHOR={Ligeti, Balázs and Szepesi-Nagy, István and Bodnár, Babett and Ligeti-Nagy, Noémi and Juhász, János},
184
+ TITLE={ProkBERT family: genomic language models for microbiome applications},
185
+ JOURNAL={Frontiers in Microbiology},
186
+ VOLUME={14},
187
+ YEAR={2024},
188
+ URL={https://www.frontiersin.org/articles/10.3389/fmicb.2023.1331233},
189
+ DOI={10.3389/fmicb.2023.1331233},
190
+ ISSN={1664-302X},
191
+ ABSTRACT={...}
192
+ }
193
+