PyTorch
megatron-bert
File size: 6,804 Bytes
99802f9
 
 
 
 
 
3b7ca88
99802f9
 
3b7ca88
99802f9
 
 
 
 
 
3b7ca88
99802f9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3b7ca88
6d61ae5
3b7ca88
 
 
 
 
 
 
 
 
6d61ae5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3b7ca88
6d61ae5
3b7ca88
6d61ae5
 
 
 
 
 
 
 
3b7ca88
6d61ae5
3b7ca88
6d61ae5
3b7ca88
6d61ae5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3b7ca88
 
 
 
d8b9306
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
---
license: cc-by-nc-sa-4.0
---
# ProkBERT PhaStyle

**Model Name**: neuralbioinfo/PhaStyle-mini   
**Model Type**: Genomic Language Model (ProkBERT-based)  
**Model Description**:  

ProkBERT PhaStyle is a fine-tuned genomic language model designed for phage lifestyle prediction. It classifies phages as either **virulent** or **temperate** directly from nucleotide sequences. The model is based on ProkBERT architecture and was trained on the **BACPHLIP dataset**, excluding *E. coli* sequences
By leveraging transfer learning, ProkBERT PhaStyle is optimized for handling **fragmented sequences**, commonly encountered in metagenomic and metavirome datasets. The model provides a fast, efficient alternative to traditional methods without requiring complex preprocessing pipelines or curated databases.

### Key Points:
- **Trained on BACPHLIP** dataset excluding *E. coli* sequences.
- **Segment Length** for training: 512 base pairs.
- **Output**: Binary classification (virulent or temperate).
- **Model Parameters**: ~25 million parameters.

---

## Intended Use

ProkBERT PhaStyle is designed for phage lifestyle prediction tasks, suitable for:

- **Phage Therapy**: Identifying virulent phages for bacterial infection treatment.
- **Microbiome Engineering**: Understanding the interaction between temperate and virulent phages in various microbiomes.
- **Metagenomic Studies**: Classifying fragmented phage sequences from environmental or clinical samples.

### Inference Code

ProkBERT PhaStyle requires the **ProkBERT tokenizer** and a **custom classification model** (`BertForBinaryClassificationWithPooling`). Below is a high-level overview of how to use the model in inference mode:

```python
aaa
```

```bash
python bin/PhaStyle.py \
    --fastain data/EXTREMOPHILE/extremophiles.fasta \
    --out output_predictions.tsv \
    --ftmodel neuralbioinfo/PhaStyle-mini \
    --modelclass BertForBinaryClassificationWithPooling \
    --per_device_eval_batch_size 196

```


### Datasets Used:

- **BACPHLIP (without E. coli)**: 1,868 training sequences and 246 validation sequences.
- **Guelin Collection**: 394 *Escherichia* phages (temperate and virulent types).
- **EXTREMOPHILE Phages**: 16 phages isolated from extreme environments, including deep-sea, acidic, and arsenic-rich habitats.

Each dataset was processed using **512bp segment lengths** to simulate fragmented metagenomic assemblies.

---
## Performance Results

The performance of ProkBERT PhaStyle was evaluated on various datasets, including *Escherichia* and EXTREMOPHILE phages, using segment lengths of 512bp and 1022bp. The results are summarized below:

### Performance on *Escherichia* Dataset (512bp and 1022bp segments)

| Method                   | Balanced Accuracy | MCC   | Sensitivity | Specificity |
|--------------------------|-------------------|-------|-------------|-------------|
| **ProkBERT-mini (512bp)** | 0.91              | 0.83  | 0.94        | 0.89        |
| ProkBERT-mini-long (512bp)| 0.90              | 0.82  | 0.96        | 0.85        |
| ProkBERT-mini-c (512bp)   | 0.89              | 0.80  | 0.95        | 0.84        |
| DNABERT-2-117M (512bp)    | 0.84              | 0.72  | 0.95        | 0.74        |
| Nuc. Trans.-50m (512bp)   | 0.85              | 0.72  | 0.92        | 0.78        |
| **ProkBERT-mini (1022bp)**| **0.94**          | **0.88** | **0.97**    | **0.91**    |
| ProkBERT-mini-long (1022bp)| 0.94             | 0.89  | 0.97        | 0.91        |

### Performance on EXTREMOPHILE Dataset (512bp and 1022bp segments)

| Method                   | Balanced Accuracy | MCC   | Sensitivity | Specificity |
|--------------------------|-------------------|-------|-------------|-------------|
| **ProkBERT-mini (512bp)** | 0.93              | 0.83  | 0.99        | 0.87        |
| ProkBERT-mini-long (512bp)| 0.93              | 0.82  | **1.00**    | 0.86        |
| ProkBERT-mini-c (512bp)   | 0.92              | 0.80  | 0.99        | 0.84        |
| DNABERT-2-117M (512bp)    | 0.89              | 0.74  | 0.99        | 0.79        |
| **ProkBERT-mini (1022bp)**| **0.96**          | **0.91** | **1.00**    | **0.93**    |
| ProkBERT-mini-long (1022bp)| 0.96             | 0.90  | 1.00        | 0.92        |

These tables highlight the high accuracy, MCC, and generalization capability of ProkBERT models, particularly on challenging datasets like *Escherichia* and extremophile phages. The ProkBERT-mini and ProkBERT-mini-long models consistently performed well on both datasets.

For more detailed results, including additional metrics, please refer to the original research paper.
---
## Inference Speed and Running Times

The computational performance of ProkBERT PhaStyle was evaluated using 1,000 randomly selected sequences from the BACPHLIP dataset. The evaluation was performed on a consistent hardware setup with NVIDIA Tesla A100 GPUs. The execution times and inference speeds of various models are summarized below:

### Execution Times (in seconds)

| Model                   | Execution Time (seconds) | Inference Speed (MB/sec) |
|--------------------------|--------------------------|--------------------------|
| **ProkBERT-mini-long**    | **132**                  | **0.52**                 |
| ProkBERT-mini             | 141                      | 0.49                     |
| ProkBERT-mini-c           | 146                      | 0.47                     |
| DNABERT-2-117M            | 248                      | 0.25                     |
| Nucleotide Transformer-50m| 342                      | 0.18                     |
| Nucleotide Transformer-500m| 502                     | 0.12                     |
| DeePhage                  | 159                      | 0.43                     |
| PhaTYP                    | 2,718                    | 0.03                     |
| BACPHLIP                  | 7,125                    | 0.01                     |


## Limitations

ProkBERT PhaStyle is specifically designed for **binary classification** of phage lifestyles (virulent vs. temperate) and does not handle non-phage sequences. It is recommended to use this model in conjunction with upstream pipelines that identify phage sequences. For large-scale inference, **GPU support** is strongly advised.

# Citing this work

If you use the data in this package, please cite:

```bibtex
@Article{ProkBERT2024,
  author  = {Ligeti, Balázs and Szepesi-Nagy, István and Bodnár, Babett and Ligeti-Nagy, Noémi and Juhász, János},
  journal = {Frontiers in Microbiology},
  title   = {{ProkBERT} family: genomic language models for microbiome applications},
  year    = {2024},
  volume  = {14},
  URL={https://www.frontiersin.org/articles/10.3389/fmicb.2023.1331233},       
	DOI={10.3389/fmicb.2023.1331233},      
	ISSN={1664-302X}
}
```