ligeti commited on
Commit
c81b104
1 Parent(s): 129dbcf

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +60 -17
README.md CHANGED
@@ -5,12 +5,12 @@ license: cc-by-nc-4.0
5
 
6
  ## ProkBERT-mini Model
7
 
8
- ProkBERT-mini-k6s2 is part of the ProkBERT family of genomic language models, specifically designed for microbiome applications. This model model can provide extended context size up to 4kb sequence by implementing the LCA tokenization with k-mer=6 and shift 2. This model showed comparable performance compare to other family member models.
9
 
10
 
11
  ## Simple Usage Example
12
 
13
- The following example demonstrates how to use the ProkBERT-mini model for processing a DNA sequence:
14
 
15
  ```python
16
  from transformers import MegatronBertForMaskedLM
@@ -23,7 +23,7 @@ tokenization_parameters = {
23
  }
24
  # Initialize the tokenizer and model
25
  tokenizer = ProkBERTTokenizer(tokenization_params=tokenization_parameters, operation_space='sequence')
26
- model = MegatronBertForMaskedLM.from_pretrained("nerualbioinfo/prokbert-mini-k6s2")
27
  # Example DNA sequence
28
  sequence = 'ATGTCCGCGGGACCT'
29
  # Tokenize the sequence
@@ -38,7 +38,7 @@ outputs = model(**inputs)
38
 
39
  **Developed by:** Neural Bioinformatics Research Group
40
 
41
- **Architecture:** ProkBERT-mini-k6s2 is based on the MegatronBert architecture, a variant of the BERT model optimized for large-scale training. The model employs a learnable relative key-value positional embedding, mapping input vectors into a 384-dimensional space.
42
 
43
 
44
  **Tokenizer:** The model uses a 6-mer tokenizer with a shift of 2 (k6s2), specifically designed to handle DNA sequences efficiently.
@@ -47,15 +47,15 @@ outputs = model(**inputs)
47
 
48
  | Parameter | Description |
49
  |----------------------|--------------------------------------|
50
- | Model Size | 20.6 million parameters |
51
- | Max. Context Size | 1024 bp |
52
  | Training Data | 206.65 billion nucleotides |
53
  | Layers | 6 |
54
  | Attention Heads | 6 |
55
 
56
  ### Intended Use
57
 
58
- **Intended Use Cases:** ProkBERT-mini-k6-s2 is intended for bioinformatics researchers and practitioners focusing on genomic sequence analysis, including:
59
  - sequence classification tasks
60
  - Exploration of genomic patterns and features
61
 
@@ -109,15 +109,14 @@ except ImportError:
109
  - **Masked Language Modeling (MLM):** The MLM objective was modified for genomic sequences for masking overlapping k-mers.
110
  - **Training Phases:** The model underwent initial training with complete sequence restoration and selective masking, followed by a succeeding phase with variable-length datasets for increased complexity.
111
 
112
- ### Evaluation Results for ProkBERT-mini
113
-
114
- | Model | L | Avg. Ref. Rank | Avg. Top1 | Avg. Top3 | Avg. AUC |
115
- |-------------------|------|----------------|-----------|-----------|-----------|
116
- | ProkBERT-mini | 128 | 0.9315 | 0.4497 | 0.8960 | 0.9998 |
117
- | ProkBERT-mini | 256 | 0.8433 | 0.4848 | 0.9130 | 0.9998 |
118
- | ProkBERT-mini | 512 | 0.8098 | 0.5056 | 0.9179 | 0.9998 |
119
- | ProkBERT-mini | 1024 | 0.7825 | 0.5169 | 0.9227 | 0.9998 |
120
 
 
 
 
 
 
 
121
 
122
  *Masking performance of the ProkBERT family.*
123
 
@@ -153,11 +152,55 @@ except ImportError:
153
 
154
  *Promoter prediction performance metrics on a diverse test set. A comparative analysis of various promoter prediction tools, showcasing their performance across key metrics including accuracy, F1 score, MCC, sensitivity, and specificity.*
155
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
156
 
157
 
158
  ### Ethical Considerations and Limitations
159
 
160
- As with all models in the bioinformatics domain, ProkBERT-mini-k6-s2 should be used responsibly. Testing and evaluation have been conducted within specific genomic contexts, and the model's outputs in other scenarios are not guaranteed. Users should exercise caution and perform additional testing as necessary for their specific use cases.
161
 
162
  ### Reporting Issues
163
 
@@ -182,4 +225,4 @@ If you use ProkBERT-mini in your research, please cite the following paper:
182
  ISSN={1664-302X},
183
  ABSTRACT={...}
184
  }
185
- ```
 
5
 
6
  ## ProkBERT-mini Model
7
 
8
+ ProkBERT-mini-long (also prokbert-mini-k6s2) is part of the ProkBERT family of genomic language models, specifically designed for microbiome applications. This model model can provide extended context size up to 4kb sequence by implementing the LCA tokenization with k-mer=6 and shift 2. This model showed comparable performance compare to other family member models.
9
 
10
 
11
  ## Simple Usage Example
12
 
13
+ The following example demonstrates how to use the ProkBERT-mini-long model for processing a DNA sequence:
14
 
15
  ```python
16
  from transformers import MegatronBertForMaskedLM
 
23
  }
24
  # Initialize the tokenizer and model
25
  tokenizer = ProkBERTTokenizer(tokenization_params=tokenization_parameters, operation_space='sequence')
26
+ model = MegatronBertForMaskedLM.from_pretrained("neuralbioinfo/prokbert-mini-long")
27
  # Example DNA sequence
28
  sequence = 'ATGTCCGCGGGACCT'
29
  # Tokenize the sequence
 
38
 
39
  **Developed by:** Neural Bioinformatics Research Group
40
 
41
+ **Architecture:** ProkBERT-mini-k6s1 is based on the MegatronBert architecture, a variant of the BERT model optimized for large-scale training. The model employs a learnable relative key-value positional embedding, mapping input vectors into a 384-dimensional space.
42
 
43
 
44
  **Tokenizer:** The model uses a 6-mer tokenizer with a shift of 2 (k6s2), specifically designed to handle DNA sequences efficiently.
 
47
 
48
  | Parameter | Description |
49
  |----------------------|--------------------------------------|
50
+ | Model Size | 26.6 million parameters |
51
+ | Max. Context Size | 4096 bp |
52
  | Training Data | 206.65 billion nucleotides |
53
  | Layers | 6 |
54
  | Attention Heads | 6 |
55
 
56
  ### Intended Use
57
 
58
+ **Intended Use Cases:** ProkBERT-mini-long is intended for bioinformatics researchers and practitioners focusing on genomic sequence analysis, including:
59
  - sequence classification tasks
60
  - Exploration of genomic patterns and features
61
 
 
109
  - **Masked Language Modeling (MLM):** The MLM objective was modified for genomic sequences for masking overlapping k-mers.
110
  - **Training Phases:** The model underwent initial training with complete sequence restoration and selective masking, followed by a succeeding phase with variable-length datasets for increased complexity.
111
 
112
+ ### Evaluation Results for ProkBERT-mini-long
 
 
 
 
 
 
 
113
 
114
+ | Model | L | Avg. Ref. Rank | Avg. Top1 | Avg. Top3 | Avg. AUC |
115
+ |------------------------|----:|---------------:|----------:|----------:|---------:|
116
+ | `ProkBERT-mini-long` | 128 | 3.9432 | 0.2164 | 0.4781 | 0.9991 |
117
+ | `ProkBERT-mini-long` | 256 | 3.5072 | 0.2470 | 0.5258 | 0.9992 |
118
+ | `ProkBERT-mini-long` | 512 | 3.3026 | 0.2669 | 0.5435 | 0.9992 |
119
+ | `ProkBERT-mini-long` |1024 | 3.2082 | 0.2768 | 0.5589 | 0.9992 |
120
 
121
  *Masking performance of the ProkBERT family.*
122
 
 
152
 
153
  *Promoter prediction performance metrics on a diverse test set. A comparative analysis of various promoter prediction tools, showcasing their performance across key metrics including accuracy, F1 score, MCC, sensitivity, and specificity.*
154
 
155
+ ### Evaluation on phage recognition benchmark
156
+
157
+ | method | L | auc_class1 | acc | f1 | mcc | recall | sensitivity | specificity | tn | fp | fn | tp | Np | Nn | eval_time |
158
+ |:--------------|-----:|-------------:|---------:|---------:|---------:|---------:|--------------:|--------------:|-----:|-----:|-----:|-----:|------:|------:|------------:|
159
+ | DeepVirFinder | 256 | 0.734914 | 0.627163 | 0.481213 | 0.309049 | 0.345317 | 0.345317 | 0.909856 | 4542 | 450 | 3278 | 1729 | 5007 | 4992 | 7580 |
160
+ | DeepVirFinder | 512 | 0.791423 | 0.708 | 0.637717 | 0.443065 | 0.521192 | 0.521192 | 0.889722 | 4510 | 559 | 2361 | 2570 | 4931 | 5069 | 2637 |
161
+ | DeepVirFinder | 1024 | 0.826255 | 0.7424 | 0.702678 | 0.505333 | 0.605651 | 0.605651 | 0.880579 | 4380 | 594 | 1982 | 3044 | 5026 | 4974 | 1294 |
162
+ | DeepVirFinder | 2048 | 0.853098 | 0.7717 | 0.743339 | 0.557177 | 0.6612 | 0.6612 | 0.8822 | 4411 | 589 | 1694 | 3306 | 5000 | 5000 | 1351 |
163
+ | INHERIT | 256 | 0.75982 | 0.6943 | 0.67012 | 0.393179 | 0.620008 | 0.620008 | 0.76883 | 3838 | 1154 | 1903 | 3105 | 5008 | 4992 | 2131 |
164
+ | INHERIT | 512 | 0.816326 | 0.7228 | 0.651408 | 0.479323 | 0.525248 | 0.525248 | 0.914973 | 4638 | 431 | 2341 | 2590 | 4931 | 5069 | 2920 |
165
+ | INHERIT | 1024 | 0.846547 | 0.7264 | 0.659447 | 0.495935 | 0.527059 | 0.527059 | 0.927825 | 4615 | 359 | 2377 | 2649 | 5026 | 4974 | 3055 |
166
+ | INHERIT | 2048 | 0.864122 | 0.7365 | 0.668595 | 0.518541 | 0.5316 | 0.5316 | 0.9414 | 4707 | 293 | 2342 | 2658 | 5000 | 5000 | 3225 |
167
+ | MINI | 256 | 0.846745 | 0.7755 | 0.766462 | 0.552855 | 0.735623 | 0.735623 | 0.815505 | 4071 | 921 | 1324 | 3684 | 5008 | 4992 | 6.68888 |
168
+ | MINI | 512 | 0.924973 | 0.8657 | 0.859121 | 0.732696 | 0.83046 | 0.83046 | 0.89998 | 4562 | 507 | 836 | 4095 | 4931 | 5069 | 16.3681 |
169
+ | MINI | 1024 | 0.956432 | 0.9138 | 0.911189 | 0.829645 | 0.879825 | 0.879825 | 0.94813 | 4716 | 258 | 604 | 4422 | 5026 | 4974 | 51.3319 |
170
+ | MINI-C | 256 | 0.827635 | 0.7512 | 0.7207 | 0.51538 | 0.640974 | 0.640974 | 0.861779 | 4302 | 690 | 1798 | 3210 | 5008 | 4992 | 7.33697 |
171
+ | MINI-C | 512 | 0.913378 | 0.8466 | 0.834876 | 0.69725 | 0.786453 | 0.786453 | 0.905109 | 4588 | 481 | 1053 | 3878 | 4931 | 5069 | 17.6749 |
172
+ | MINI-C | 1024 | 0.94644 | 0.8937 | 0.891564 | 0.788427 | 0.869479 | 0.869479 | 0.918175 | 4567 | 407 | 656 | 4370 | 5026 | 4974 | 54.204 |
173
+ | MINI-LONG | 256 | 0.777697 | 0.71495 | 0.686224 | 0.437727 | 0.622404 | 0.622404 | 0.807792 | 8065 | 1919 | 3782 | 6234 | 10016 | 9984 | 6.10304 |
174
+ | MINI-LONG | 512 | 0.880831 | 0.81405 | 0.798001 | 0.632855 | 0.744879 | 0.744879 | 0.881338 | 8935 | 1203 | 2516 | 7346 | 9862 | 10138 | 12.1307 |
175
+ | MINI-LONG | 1024 | 0.9413 | 0.88925 | 0.884917 | 0.781465 | 0.847195 | 0.847195 | 0.931745 | 9269 | 679 | 1536 | 8516 | 10052 | 9948 | 30.5088 |
176
+ | MINI-LONG | 2048 | 0.964551 | 0.929 | 0.927455 | 0.85878 | 0.9077 | 0.9077 | 0.9503 | 9503 | 497 | 923 | 9077 | 10000 | 10000 | 94.404 |
177
+ | Virsorter2 | 512 | 0.620782 | 0.6259 | 0.394954 | 0.364831 | 0.247617 | 0.247617 | 0.993884 | 5038 | 31 | 3710 | 1221 | 4931 | 5069 | 2057 |
178
+ | Virsorter2 | 1024 | 0.719898 | 0.7178 | 0.621919 | 0.51036 | 0.461799 | 0.461799 | 0.976478 | 4857 | 117 | 2705 | 2321 | 5026 | 4974 | 3258 |
179
+ | Virsorter2 | 2048 | 0.816142 | 0.8103 | 0.778724 | 0.647532 | 0.6676 | 0.6676 | 0.953 | 4765 | 235 | 1662 | 3338 | 5000 | 5000 | 5737 |
180
+
181
+
182
+ ### Column Descriptions
183
+
184
+ - **method**: The algorithm or method used for prediction (e.g., DeepVirFinder, INHERIT).
185
+ - **L**: Length of the genomic segment.
186
+ - **auc_class1**: Area under the ROC curve for class 1, indicating the model's ability to distinguish between classes.
187
+ - **acc**: Accuracy of the prediction, representing the proportion of true results (both true positives and true negatives) among the total number of cases examined.
188
+ - **f1**: The F1 score, a measure of a test's accuracy that considers both the precision and the recall.
189
+ - **mcc**: Matthews correlation coefficient, a quality measure for binary (two-class) classifications.
190
+ - **recall**: The recall, or true positive rate, measures the proportion of actual positives that are correctly identified.
191
+ - **sensitivity**: Sensitivity or true positive rate; identical to recall.
192
+ - **specificity**: The specificity, or true negative rate, measures the proportion of actual negatives that are correctly identified.
193
+ - **fp**: The number of false positives, indicating how many negative class samples were incorrectly identified as positive.
194
+ - **tp**: The number of true positives, indicating how many positive class samples were correctly identified.
195
+ - **eval_time**: The time taken to evaluate the model or method, usually in seconds.
196
+
197
+
198
+
199
 
200
 
201
  ### Ethical Considerations and Limitations
202
 
203
+ As with all models in the bioinformatics domain, ProkBERT-mini-k6-s1 should be used responsibly. Testing and evaluation have been conducted within specific genomic contexts, and the model's outputs in other scenarios are not guaranteed. Users should exercise caution and perform additional testing as necessary for their specific use cases.
204
 
205
  ### Reporting Issues
206
 
 
225
  ISSN={1664-302X},
226
  ABSTRACT={...}
227
  }
228
+ ```