PyTorch
megatron-bert
ligeti commited on
Commit
3b7ca88
1 Parent(s): 99802f9

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +29 -4
README.md CHANGED
@@ -4,18 +4,17 @@ license: cc-by-nc-sa-4.0
4
  # ProkBERT PhaStyle
5
 
6
  **Model Name**: neuralbioinfo/PhaStyle-mini
7
- **Model Type**: Genomic Language Model (BERT-based)
8
  **Model Description**:
9
 
10
- ProkBERT PhaStyle is a fine-tuned genomic language model designed for phage lifestyle prediction. It classifies phages as either **virulent** or **temperate** directly from nucleotide sequences. The model is based on BERT architecture and was trained on the **BACPHLIP dataset**, excluding *E. coli* sequences, allowing it to generalize to phages beyond the *E. coli* domain.
11
-
12
  By leveraging transfer learning, ProkBERT PhaStyle is optimized for handling **fragmented sequences**, commonly encountered in metagenomic and metavirome datasets. The model provides a fast, efficient alternative to traditional methods without requiring complex preprocessing pipelines or curated databases.
13
 
14
  ### Key Points:
15
  - **Trained on BACPHLIP** dataset excluding *E. coli* sequences.
16
  - **Segment Length** for training: 512 base pairs.
17
  - **Output**: Binary classification (virulent or temperate).
18
- - **Model Parameters**: ~21-26 million parameters depending on the variant used.
19
 
20
  ---
21
 
@@ -44,3 +43,29 @@ python bin/PhaStyle.py \
44
  --per_device_eval_batch_size 196
45
 
46
  ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4
  # ProkBERT PhaStyle
5
 
6
  **Model Name**: neuralbioinfo/PhaStyle-mini
7
+ **Model Type**: Genomic Language Model (ProkBERT-based)
8
  **Model Description**:
9
 
10
+ ProkBERT PhaStyle is a fine-tuned genomic language model designed for phage lifestyle prediction. It classifies phages as either **virulent** or **temperate** directly from nucleotide sequences. The model is based on ProkBERT architecture and was trained on the **BACPHLIP dataset**, excluding *E. coli* sequences
 
11
  By leveraging transfer learning, ProkBERT PhaStyle is optimized for handling **fragmented sequences**, commonly encountered in metagenomic and metavirome datasets. The model provides a fast, efficient alternative to traditional methods without requiring complex preprocessing pipelines or curated databases.
12
 
13
  ### Key Points:
14
  - **Trained on BACPHLIP** dataset excluding *E. coli* sequences.
15
  - **Segment Length** for training: 512 base pairs.
16
  - **Output**: Binary classification (virulent or temperate).
17
+ - **Model Parameters**: ~25 million parameters.
18
 
19
  ---
20
 
 
43
  --per_device_eval_batch_size 196
44
 
45
  ```
46
+
47
+ ### Datasets Used:
48
+
49
+ - **BACPHLIP (without E. coli)**: 1,868 training sequences and 246 validation sequences.
50
+ - **Guelin Collection**: 394 *Escherichia* phages (temperate and virulent types).
51
+ - **EXTREMOPHILE Phages**: 16 phages isolated from extreme environments, including deep-sea, acidic, and arsenic-rich habitats.
52
+
53
+ Each dataset was processed using **512bp segment lengths** to simulate fragmented metagenomic assemblies.
54
+
55
+ ---
56
+
57
+ ## Performance
58
+
59
+ ProkBERT PhaStyle outperforms state-of-the-art models, especially in generalization and speed. It has been benchmarked on **short fragments** (512bp) and **phages from unseen environments**, demonstrating its robustness for both environmental and clinical datasets.
60
+
61
+ ### Key Metrics:
62
+ - **Balanced Accuracy**: 0.94 (on 1022bp fragments from the *Escherichia* dataset)
63
+ - **MCC (Matthews Correlation Coefficient)**: 0.91
64
+ - **Sensitivity**: 0.97
65
+ - **Specificity**: 0.91
66
+
67
+ ---
68
+
69
+ ## Limitations
70
+
71
+ ProkBERT PhaStyle is specifically designed for **binary classification** of phage lifestyles (virulent vs. temperate) and does not handle non-phage sequences. It is recommended to use this model in conjunction with upstream pipelines that identify phage sequences. For large-scale inference, **GPU support** is strongly advised.