Update README.md
Browse files
README.md
CHANGED
@@ -4,18 +4,17 @@ license: cc-by-nc-sa-4.0
|
|
4 |
# ProkBERT PhaStyle
|
5 |
|
6 |
**Model Name**: neuralbioinfo/PhaStyle-mini
|
7 |
-
**Model Type**: Genomic Language Model (
|
8 |
**Model Description**:
|
9 |
|
10 |
-
ProkBERT PhaStyle is a fine-tuned genomic language model designed for phage lifestyle prediction. It classifies phages as either **virulent** or **temperate** directly from nucleotide sequences. The model is based on
|
11 |
-
|
12 |
By leveraging transfer learning, ProkBERT PhaStyle is optimized for handling **fragmented sequences**, commonly encountered in metagenomic and metavirome datasets. The model provides a fast, efficient alternative to traditional methods without requiring complex preprocessing pipelines or curated databases.
|
13 |
|
14 |
### Key Points:
|
15 |
- **Trained on BACPHLIP** dataset excluding *E. coli* sequences.
|
16 |
- **Segment Length** for training: 512 base pairs.
|
17 |
- **Output**: Binary classification (virulent or temperate).
|
18 |
-
- **Model Parameters**: ~
|
19 |
|
20 |
---
|
21 |
|
@@ -44,3 +43,29 @@ python bin/PhaStyle.py \
|
|
44 |
--per_device_eval_batch_size 196
|
45 |
|
46 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
4 |
# ProkBERT PhaStyle
|
5 |
|
6 |
**Model Name**: neuralbioinfo/PhaStyle-mini
|
7 |
+
**Model Type**: Genomic Language Model (ProkBERT-based)
|
8 |
**Model Description**:
|
9 |
|
10 |
+
ProkBERT PhaStyle is a fine-tuned genomic language model designed for phage lifestyle prediction. It classifies phages as either **virulent** or **temperate** directly from nucleotide sequences. The model is based on ProkBERT architecture and was trained on the **BACPHLIP dataset**, excluding *E. coli* sequences
|
|
|
11 |
By leveraging transfer learning, ProkBERT PhaStyle is optimized for handling **fragmented sequences**, commonly encountered in metagenomic and metavirome datasets. The model provides a fast, efficient alternative to traditional methods without requiring complex preprocessing pipelines or curated databases.
|
12 |
|
13 |
### Key Points:
|
14 |
- **Trained on BACPHLIP** dataset excluding *E. coli* sequences.
|
15 |
- **Segment Length** for training: 512 base pairs.
|
16 |
- **Output**: Binary classification (virulent or temperate).
|
17 |
+
- **Model Parameters**: ~25 million parameters.
|
18 |
|
19 |
---
|
20 |
|
|
|
43 |
--per_device_eval_batch_size 196
|
44 |
|
45 |
```
|
46 |
+
|
47 |
+
### Datasets Used:
|
48 |
+
|
49 |
+
- **BACPHLIP (without E. coli)**: 1,868 training sequences and 246 validation sequences.
|
50 |
+
- **Guelin Collection**: 394 *Escherichia* phages (temperate and virulent types).
|
51 |
+
- **EXTREMOPHILE Phages**: 16 phages isolated from extreme environments, including deep-sea, acidic, and arsenic-rich habitats.
|
52 |
+
|
53 |
+
Each dataset was processed using **512bp segment lengths** to simulate fragmented metagenomic assemblies.
|
54 |
+
|
55 |
+
---
|
56 |
+
|
57 |
+
## Performance
|
58 |
+
|
59 |
+
ProkBERT PhaStyle outperforms state-of-the-art models, especially in generalization and speed. It has been benchmarked on **short fragments** (512bp) and **phages from unseen environments**, demonstrating its robustness for both environmental and clinical datasets.
|
60 |
+
|
61 |
+
### Key Metrics:
|
62 |
+
- **Balanced Accuracy**: 0.94 (on 1022bp fragments from the *Escherichia* dataset)
|
63 |
+
- **MCC (Matthews Correlation Coefficient)**: 0.91
|
64 |
+
- **Sensitivity**: 0.97
|
65 |
+
- **Specificity**: 0.91
|
66 |
+
|
67 |
+
---
|
68 |
+
|
69 |
+
## Limitations
|
70 |
+
|
71 |
+
ProkBERT PhaStyle is specifically designed for **binary classification** of phage lifestyles (virulent vs. temperate) and does not handle non-phage sequences. It is recommended to use this model in conjunction with upstream pipelines that identify phage sequences. For large-scale inference, **GPU support** is strongly advised.
|