File size: 15,647 Bytes
129dbcf 6d78479 129dbcf ac01e26 c81b104 129dbcf ac01e26 129dbcf ac01e26 c81b104 129dbcf c81b104 129dbcf c81b104 129dbcf c81b104 129dbcf c81b104 129dbcf c81b104 129dbcf c81b104 129dbcf c81b104 129dbcf c81b104 129dbcf c81b104 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 |
---
license: cc-by-nc-4.0
tags:
- prokbert
- bioinformatics
- genomics
- sequence embedding
- genomic language models
- nucleotide
- dna-sequence
- promoter-prediction
- phage
---
## ProkBERT-mini Model
ProkBERT-mini-long (also prokbert-mini-k6s2) is part of the ProkBERT family of genomic language models, specifically designed for microbiome applications. This model model can provide extended context size up to 4kb sequence by implementing the LCA tokenization with k-mer=6 and shift 2. This model showed comparable performance compare to other family member models.
## Simple Usage Example
The following example demonstrates how to use the ProkBERT-mini-long model for processing a DNA sequence:
```python
from transformers import MegatronBertForMaskedLM
from prokbert.prokbert_tokenizer import ProkBERTTokenizer
# Tokenization parameters
tokenization_parameters = {
'kmer': 6,
'shift': 2
}
# Initialize the tokenizer and model
tokenizer = ProkBERTTokenizer(tokenization_params=tokenization_parameters, operation_space='sequence')
model = MegatronBertForMaskedLM.from_pretrained("neuralbioinfo/prokbert-mini-long")
# Example DNA sequence
sequence = 'ATGTCCGCGGGACCT'
# Tokenize the sequence
inputs = tokenizer(sequence, return_tensors="pt")
# Ensure that inputs have a batch dimension
inputs = {key: value.unsqueeze(0) for key, value in inputs.items()}
# Generate outputs from the model
outputs = model(**inputs)
```
### Model Details
**Developed by:** Neural Bioinformatics Research Group
**Architecture:** ProkBERT-mini-k6s1 is based on the MegatronBert architecture, a variant of the BERT model optimized for large-scale training. The model employs a learnable relative key-value positional embedding, mapping input vectors into a 384-dimensional space.
**Tokenizer:** The model uses a 6-mer tokenizer with a shift of 2 (k6s2), specifically designed to handle DNA sequences efficiently.
**Parameters:**
| Parameter | Description |
|----------------------|--------------------------------------|
| Model Size | 26.6 million parameters |
| Max. Context Size | 4096 bp |
| Training Data | 206.65 billion nucleotides |
| Layers | 6 |
| Attention Heads | 6 |
### Intended Use
**Intended Use Cases:** ProkBERT-mini-long is intended for bioinformatics researchers and practitioners focusing on genomic sequence analysis, including:
- sequence classification tasks
- Exploration of genomic patterns and features
## Segmentation and Tokenization in ProkBERT Models
### Preprocessing Sequence Data
Transformer models, including ProkBERT, have a context size limitation. ProkBERT's design accommodates context sizes significantly larger than an average gene but smaller than the average bacterial genome.
The initial stage of our pipeline involves two primary steps: segmentation and tokenization.
For more details about tokenization, please see the following notebook: [Tokenization Notebook in Google Colab](https://colab.research.google.com/github/nbrg-ppcu/prokbert/blob/main/examples/Tokenization.ipynb).
For more details about segmentation, please see the following notebook: [Segmentation Notebook in Google Colab](https://colab.research.google.com/github/nbrg-ppcu/prokbert/blob/main/examples/Segmentation.ipynb).
#### Segmentation
Segmentation is crucial for Genomic Language Models (GLMs) as they process limited-size chunks of sequence data, typically ranging from 0 to 4kb. The sequence is divided into smaller parts through segmentation, which can be either contiguous, splitting the sequence into disjoint segments, or random, involving randomly sampling segments of length L.
The first practical step in segmentation involves loading the sequence from a FASTA file, often including the reverse complement of the sequence.
**Segmentation process:**
![Segmentation Process](https://github.com/nbrg-ppcu/prokbert/blob/main/assets/Figure2_segmentation.png?raw=true)
#### Tokenization Process
After segmentation, sequences are encoded into a vector format. The LCA method allows the model to use a broader context and reduce computational demands while maintaining the information-rich local context.
## Basic Steps for Preprocessing:
1. **Load Fasta Files**: Begin by loading the raw sequence data from FASTA files.
2. **Segment the Raw Sequences**: Apply segmentation parameters to split the sequences into manageable segments.
3. **Tokenize the Segmented Database**: Use the defined tokenization parameters to convert the segments into tokenized forms.
4. **Create a Padded/Truncated Array**: Generate a uniform array structure, padding or truncating as necessary.
5. **Save the Array to HDF**: Store the processed data in an HDF (Hierarchical Data Format) file for efficient retrieval and use in training models.
### Installation of ProkBERT (if needed)
For setting up ProkBERT in your environment, you can install it using the following command (if not already installed):
```python
try:
import prokbert
print("ProkBERT is already installed.")
except ImportError:
!pip install prokbert
print("Installed ProkBERT.")
```
### Training Data and Process
**Overview:** The model was pretrained on a comprehensive dataset of genomic sequences to ensure broad coverage and robust learning.
**Training Process:**
- **Masked Language Modeling (MLM):** The MLM objective was modified for genomic sequences for masking overlapping k-mers.
- **Training Phases:** The model underwent initial training with complete sequence restoration and selective masking, followed by a succeeding phase with variable-length datasets for increased complexity.
### Evaluation Results for ProkBERT-mini-long
| Model | L | Avg. Ref. Rank | Avg. Top1 | Avg. Top3 | Avg. AUC |
|------------------------|----:|---------------:|----------:|----------:|---------:|
| `ProkBERT-mini-long` | 128 | 3.9432 | 0.2164 | 0.4781 | 0.9991 |
| `ProkBERT-mini-long` | 256 | 3.5072 | 0.2470 | 0.5258 | 0.9992 |
| `ProkBERT-mini-long` | 512 | 3.3026 | 0.2669 | 0.5435 | 0.9992 |
| `ProkBERT-mini-long` |1024 | 3.2082 | 0.2768 | 0.5589 | 0.9992 |
*Masking performance of the ProkBERT family.*
### Evaluation of Promoter Prediction Tools on E-coli Sigma70 Dataset
| Tool | Accuracy | MCC | Sensitivity | Specificity |
|-----------------------|----------|-------|-------------|-------------|
| ProkBERT-mini | **0.87** | **0.74** | 0.90 | 0.85 |
| ProkBERT-mini-c | **0.87** | 0.73 | 0.88 | 0.85 |
| ProkBERT-mini-long | **0.87** | **0.74** | 0.89 | 0.85 |
| CNNProm | 0.72 | 0.50 | 0.95 | 0.51 |
| iPro70-FMWin | 0.76 | 0.53 | 0.84 | 0.69 |
| 70ProPred | 0.74 | 0.51 | 0.90 | 0.60 |
| iPromoter-2L | 0.64 | 0.37 | 0.94 | 0.37 |
| Multiply | 0.50 | 0.05 | 0.81 | 0.23 |
| bTSSfinder | 0.46 | -0.07 | 0.48 | 0.45 |
| BPROM | 0.56 | 0.10 | 0.20 | 0.87 |
| IBPP | 0.50 | -0.03 | 0.26 | 0.71 |
| Promotech | 0.71 | 0.43 | 0.49 | **0.90** |
| Sigma70Pred | 0.66 | 0.42 | 0.95 | 0.41 |
| iPromoter-BnCNN | 0.55 | 0.27 | **0.99** | 0.18 |
| MULTiPly | 0.54 | 0.19 | 0.92 | 0.22 |
*The ProkBERT family models exhibit remarkably consistent performance across the metrics assessed. With respect to accuracy, all three tools achieve an impressive*
| Metric | ProkBERT-mini | ProkBERT-mini-c | ProkBERT-mini-long | Promotech | Sigma70Pred | iPromoter-BnCNN | MULTiPly |
|--------------|---------------|-----------------|--------------------|-----------|-------------|-----------------|----------|
| Accuracy | 0.81 | 0.79 | 0.81 | 0.61 | 0.62 | 0.61 | 0.58 |
| F1 | 0.81 | 0.78 | 0.81 | 0.43 | 0.58 | 0.65 | 0.58 |
| MCC | 0.63 | 0.57 | 0.62 | 0.29 | 0.24 | 0.21 | 0.16 |
| Sensitivity | 0.81 | 0.75 | 0.79 | 0.29 | 0.52 | 0.66 | 0.57 |
| Specificity | 0.82 | 0.82 | 0.83 | 0.93 | 0.71 | 0.55 | 0.59 |
*Promoter prediction performance metrics on a diverse test set. A comparative analysis of various promoter prediction tools, showcasing their performance across key metrics including accuracy, F1 score, MCC, sensitivity, and specificity.*
### Evaluation on phage recognition benchmark
| method | L | auc_class1 | acc | f1 | mcc | recall | sensitivity | specificity | tn | fp | fn | tp | Np | Nn | eval_time |
|:--------------|-----:|-------------:|---------:|---------:|---------:|---------:|--------------:|--------------:|-----:|-----:|-----:|-----:|------:|------:|------------:|
| DeepVirFinder | 256 | 0.734914 | 0.627163 | 0.481213 | 0.309049 | 0.345317 | 0.345317 | 0.909856 | 4542 | 450 | 3278 | 1729 | 5007 | 4992 | 7580 |
| DeepVirFinder | 512 | 0.791423 | 0.708 | 0.637717 | 0.443065 | 0.521192 | 0.521192 | 0.889722 | 4510 | 559 | 2361 | 2570 | 4931 | 5069 | 2637 |
| DeepVirFinder | 1024 | 0.826255 | 0.7424 | 0.702678 | 0.505333 | 0.605651 | 0.605651 | 0.880579 | 4380 | 594 | 1982 | 3044 | 5026 | 4974 | 1294 |
| DeepVirFinder | 2048 | 0.853098 | 0.7717 | 0.743339 | 0.557177 | 0.6612 | 0.6612 | 0.8822 | 4411 | 589 | 1694 | 3306 | 5000 | 5000 | 1351 |
| INHERIT | 256 | 0.75982 | 0.6943 | 0.67012 | 0.393179 | 0.620008 | 0.620008 | 0.76883 | 3838 | 1154 | 1903 | 3105 | 5008 | 4992 | 2131 |
| INHERIT | 512 | 0.816326 | 0.7228 | 0.651408 | 0.479323 | 0.525248 | 0.525248 | 0.914973 | 4638 | 431 | 2341 | 2590 | 4931 | 5069 | 2920 |
| INHERIT | 1024 | 0.846547 | 0.7264 | 0.659447 | 0.495935 | 0.527059 | 0.527059 | 0.927825 | 4615 | 359 | 2377 | 2649 | 5026 | 4974 | 3055 |
| INHERIT | 2048 | 0.864122 | 0.7365 | 0.668595 | 0.518541 | 0.5316 | 0.5316 | 0.9414 | 4707 | 293 | 2342 | 2658 | 5000 | 5000 | 3225 |
| MINI | 256 | 0.846745 | 0.7755 | 0.766462 | 0.552855 | 0.735623 | 0.735623 | 0.815505 | 4071 | 921 | 1324 | 3684 | 5008 | 4992 | 6.68888 |
| MINI | 512 | 0.924973 | 0.8657 | 0.859121 | 0.732696 | 0.83046 | 0.83046 | 0.89998 | 4562 | 507 | 836 | 4095 | 4931 | 5069 | 16.3681 |
| MINI | 1024 | 0.956432 | 0.9138 | 0.911189 | 0.829645 | 0.879825 | 0.879825 | 0.94813 | 4716 | 258 | 604 | 4422 | 5026 | 4974 | 51.3319 |
| MINI-C | 256 | 0.827635 | 0.7512 | 0.7207 | 0.51538 | 0.640974 | 0.640974 | 0.861779 | 4302 | 690 | 1798 | 3210 | 5008 | 4992 | 7.33697 |
| MINI-C | 512 | 0.913378 | 0.8466 | 0.834876 | 0.69725 | 0.786453 | 0.786453 | 0.905109 | 4588 | 481 | 1053 | 3878 | 4931 | 5069 | 17.6749 |
| MINI-C | 1024 | 0.94644 | 0.8937 | 0.891564 | 0.788427 | 0.869479 | 0.869479 | 0.918175 | 4567 | 407 | 656 | 4370 | 5026 | 4974 | 54.204 |
| MINI-LONG | 256 | 0.777697 | 0.71495 | 0.686224 | 0.437727 | 0.622404 | 0.622404 | 0.807792 | 8065 | 1919 | 3782 | 6234 | 10016 | 9984 | 6.10304 |
| MINI-LONG | 512 | 0.880831 | 0.81405 | 0.798001 | 0.632855 | 0.744879 | 0.744879 | 0.881338 | 8935 | 1203 | 2516 | 7346 | 9862 | 10138 | 12.1307 |
| MINI-LONG | 1024 | 0.9413 | 0.88925 | 0.884917 | 0.781465 | 0.847195 | 0.847195 | 0.931745 | 9269 | 679 | 1536 | 8516 | 10052 | 9948 | 30.5088 |
| MINI-LONG | 2048 | 0.964551 | 0.929 | 0.927455 | 0.85878 | 0.9077 | 0.9077 | 0.9503 | 9503 | 497 | 923 | 9077 | 10000 | 10000 | 94.404 |
| Virsorter2 | 512 | 0.620782 | 0.6259 | 0.394954 | 0.364831 | 0.247617 | 0.247617 | 0.993884 | 5038 | 31 | 3710 | 1221 | 4931 | 5069 | 2057 |
| Virsorter2 | 1024 | 0.719898 | 0.7178 | 0.621919 | 0.51036 | 0.461799 | 0.461799 | 0.976478 | 4857 | 117 | 2705 | 2321 | 5026 | 4974 | 3258 |
| Virsorter2 | 2048 | 0.816142 | 0.8103 | 0.778724 | 0.647532 | 0.6676 | 0.6676 | 0.953 | 4765 | 235 | 1662 | 3338 | 5000 | 5000 | 5737 |
### Column Descriptions
- **method**: The algorithm or method used for prediction (e.g., DeepVirFinder, INHERIT).
- **L**: Length of the genomic segment.
- **auc_class1**: Area under the ROC curve for class 1, indicating the model's ability to distinguish between classes.
- **acc**: Accuracy of the prediction, representing the proportion of true results (both true positives and true negatives) among the total number of cases examined.
- **f1**: The F1 score, a measure of a test's accuracy that considers both the precision and the recall.
- **mcc**: Matthews correlation coefficient, a quality measure for binary (two-class) classifications.
- **recall**: The recall, or true positive rate, measures the proportion of actual positives that are correctly identified.
- **sensitivity**: Sensitivity or true positive rate; identical to recall.
- **specificity**: The specificity, or true negative rate, measures the proportion of actual negatives that are correctly identified.
- **fp**: The number of false positives, indicating how many negative class samples were incorrectly identified as positive.
- **tp**: The number of true positives, indicating how many positive class samples were correctly identified.
- **eval_time**: The time taken to evaluate the model or method, usually in seconds.
### Ethical Considerations and Limitations
As with all models in the bioinformatics domain, ProkBERT-mini-k6-s1 should be used responsibly. Testing and evaluation have been conducted within specific genomic contexts, and the model's outputs in other scenarios are not guaranteed. Users should exercise caution and perform additional testing as necessary for their specific use cases.
### Reporting Issues
Please report any issues with the model or its outputs to the Neural Bioinformatics Research Group through the following means:
- **Model issues:** [GitHub repository link](https://github.com/nbrg-ppcu/prokbert)
- **Feedback and inquiries:** [[email protected]](mailto:[email protected])
## Reference
If you use ProkBERT-mini in your research, please cite the following paper:
```
@ARTICLE{10.3389/fmicb.2023.1331233,
AUTHOR={Ligeti, Balázs and Szepesi-Nagy, István and Bodnár, Babett and Ligeti-Nagy, Noémi and Juhász, János},
TITLE={ProkBERT family: genomic language models for microbiome applications},
JOURNAL={Frontiers in Microbiology},
VOLUME={14},
YEAR={2024},
URL={https://www.frontiersin.org/articles/10.3389/fmicb.2023.1331233},
DOI={10.3389/fmicb.2023.1331233},
ISSN={1664-302X},
ABSTRACT={...}
}
``` |