File size: 8,089 Bytes
70fecd2
 
eaac471
 
 
 
 
 
 
 
 
 
70fecd2
eaac471
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
dd21a32
 
eaac471
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
---
license: cc-by-nc-4.0
tags:
- prokbert
- bioinformatics
- genomics
- sequence embedding
- genomic language models
- nucleotide
- dna-sequence
- promoter
- microbiom
---
## ProkBERT-mini-c-promoter Model

This finetuned model is specifically designed for promoter identification and is based on the [ProkBERT-mini-long model](https://huggingface.co/neuralbioinfo/prokbert-mini-long). 

For more details, refer to the [promoter dataset description](https://huggingface.co/datasets/neuralbioinfo/bacterial_promoters) used for training and evaluating this model.

### Example Usage

For practical examples on how to use this model, see the following Jupyter notebooks:

- [Training Notebook](https://colab.research.google.com/github/nbrg-ppcu/prokbert/blob/main/examples/Finetuning.ipynb): A guide to fine-tuning the ProkBERT-mini model for promoter identification tasks.
- [Evaluation Notebook](https://colab.research.google.com/github/nbrg-ppcu/prokbert/blob/main/examples/Inference.ipynb): Demonstrates how to evaluate the finetuned ProkBERT-mini-promoter model on test datasets.

### Model Application

The model was trained for binary classification to distinguish between promoter and non-promoter sequences. The length and composition of the promoter sequences were standardized to ensure compatibility with alternative methods and to facilitate direct comparison of model performance.



## Simple Usage Example

The following example demonstrates how to use the ProkBERT-mini-c-promoter model for processing a DNA sequence:

```python
from prokbert.prokbert_tokenizer import ProkBERTTokenizer
from prokbert.models import BertForBinaryClassificationWithPooling
finetuned_model = "neuralbioinfo/prokbert-mini-promoter"
kmer = 6
shift= 2

tok_params = {'kmer' : kmer,
             'shift' : shift}
tokenizer = ProkBERTTokenizer(tokenization_params=tok_params)
model = BertForBinaryClassificationWithPooling.from_pretrained(finetuned_model)
sequence = 'TAGCGCATAATGATTTCCTTATAAGCGATCGCTCTGAAAGCGTTCTACGATAATAATGATATCCTTTCAATAATAGCGTAT'
inputs = tokenizer(sequence, return_tensors="pt")
# Ensure that inputs have a batch dimension
inputs = {key: value.unsqueeze(0) for key, value in inputs.items()}
# Generate outputs from the model
outputs = model(**inputs)
print(outputs)

``` 

### Model Details

**Developed by:** Neural Bioinformatics Research Group

**Architecture:** 

Traditionally, models like ...SequenceClassification classify sequences based on the hidden representation of the [CLS] or starting token. However, in our approach, we utilize the base model enhanced with a pooling layer that integrates information across all nucleotides in the sequence.
The input is expected to be 80bp long, same as in the dataset.

**Tokenizer:** The model uses a 1-mer tokenizer with a shift of 1 (k1s1).

**Parameters:**

| Parameter            | Description                          |
|----------------------|--------------------------------------|
| Model Size           | 26.6 million parameters              |
| Max. Context Size    | 4096 bp                              |
| Training Data        | 206.65 billion nucleotides           |
| Layers               | 6                                    |
| Attention Heads      | 6                                    |

### Intended Use

**Intended Use Cases:** As with all models in the bioinformatics domain, ProkBERT-mini-long-promoter should be used responsibly. Testing and evaluation have been conducted within specific genomic contexts, and the model's outputs in other scenarios are not guaranteed. Users should exercise caution and perform additional testing as necessary for their specific use cases.



### Installation of ProkBERT (if needed)

For setting up ProkBERT in your environment, you can install it using the following command (if not already installed):

```python
try:
    import prokbert
    print("ProkBERT is already installed.")
except ImportError:
    !pip install prokbert
    print("Installed ProkBERT.")
```

### Training Data and Process

**Overview:** The model was pretrained on a comprehensive dataset of genomic sequences to ensure broad coverage and robust learning.


*Masking performance of the ProkBERT family.*

### Evaluation of Promoter Prediction Tools on E-coli Sigma70 Dataset

| Tool                  | Accuracy | MCC   | Sensitivity | Specificity |
|-----------------------|----------|-------|-------------|-------------|
| ProkBERT-mini         | **0.87** | **0.74** | 0.90        | 0.85        |
| ProkBERT-mini-c       | **0.87** | 0.73  | 0.88        | 0.85        |
| ProkBERT-mini-long    | **0.87** | **0.74** | 0.89        | 0.85        |
| CNNProm               | 0.72     | 0.50  | 0.95        | 0.51        |
| iPro70-FMWin          | 0.76     | 0.53  | 0.84        | 0.69        |
| 70ProPred             | 0.74     | 0.51  | 0.90        | 0.60        |
| iPromoter-2L          | 0.64     | 0.37  | 0.94        | 0.37        |
| Multiply              | 0.50     | 0.05  | 0.81        | 0.23        |
| bTSSfinder            | 0.46     | -0.07 | 0.48        | 0.45        |
| BPROM                 | 0.56     | 0.10  | 0.20        | 0.87        |
| IBPP                  | 0.50     | -0.03 | 0.26        | 0.71        |
| Promotech             | 0.71     | 0.43  | 0.49        | **0.90**    |
| Sigma70Pred           | 0.66     | 0.42  | 0.95        | 0.41        |
| iPromoter-BnCNN       | 0.55     | 0.27  | **0.99**    | 0.18        |
| MULTiPly              | 0.54     | 0.19  | 0.92        | 0.22        |

*The ProkBERT family models exhibit remarkably consistent performance across the metrics assessed. With respect to accuracy, all three tools achieve an impressive*

| Metric       | ProkBERT-mini | ProkBERT-mini-c | ProkBERT-mini-long | Promotech | Sigma70Pred | iPromoter-BnCNN | MULTiPly |
|--------------|---------------|-----------------|--------------------|-----------|-------------|-----------------|----------|
| Accuracy     | 0.81          | 0.79            | 0.81               | 0.61      | 0.62        | 0.61            | 0.58     |
| F1           | 0.81          | 0.78            | 0.81               | 0.43      | 0.58        | 0.65            | 0.58     |
| MCC          | 0.63          | 0.57            | 0.62               | 0.29      | 0.24        | 0.21            | 0.16     |
| Sensitivity  | 0.81          | 0.75            | 0.79               | 0.29      | 0.52        | 0.66            | 0.57     |
| Specificity  | 0.82          | 0.82            | 0.83               | 0.93      | 0.71        | 0.55            | 0.59     |

*Promoter prediction performance metrics on a diverse test set. A comparative analysis of various promoter prediction tools, showcasing their performance across key metrics including accuracy, F1 score, MCC, sensitivity, and specificity.*


### Ethical Considerations and Limitations

As with all models in the bioinformatics domain, ProkBERT-mini-k6-s1 should be used responsibly. Testing and evaluation have been conducted within specific genomic contexts, and the model's outputs in other scenarios are not guaranteed. Users should exercise caution and perform additional testing as necessary for their specific use cases.

### Reporting Issues

Please report any issues with the model or its outputs to the Neural Bioinformatics Research Group through the following means:

- **Model issues:** [GitHub repository link](https://github.com/nbrg-ppcu/prokbert)
- **Feedback and inquiries:** [[email protected]](mailto:[email protected])

## Reference
If you use ProkBERT-mini in your research, please cite the following paper:


```
@ARTICLE{10.3389/fmicb.2023.1331233,
    AUTHOR={Ligeti, Balázs and Szepesi-Nagy, István and Bodnár, Babett and Ligeti-Nagy, Noémi and Juhász, János},
    TITLE={ProkBERT family: genomic language models for microbiome applications},
    JOURNAL={Frontiers in Microbiology},
    VOLUME={14},
    YEAR={2024},
    URL={https://www.frontiersin.org/articles/10.3389/fmicb.2023.1331233},
    DOI={10.3389/fmicb.2023.1331233},
    ISSN={1664-302X},
    ABSTRACT={...}
}
```