neuralbioinfo
/

PhaStyle-mini

PyTorch

Safetensors

prokbert

custom_code

Model card Files Files and versions Community

ligeti commited on Oct 16, 2024

Commit

3b7ca88

verified ·

1 Parent(s): 99802f9

Update README.md

Browse files

Files changed (1) hide show

README.md +29 -4

README.md CHANGED Viewed

@@ -4,18 +4,17 @@ license: cc-by-nc-sa-4.0
 # ProkBERT PhaStyle
 **Model Name**: neuralbioinfo/PhaStyle-mini
-**Model Type**: Genomic Language Model (BERT-based)
 **Model Description**:
-ProkBERT PhaStyle is a fine-tuned genomic language model designed for phage lifestyle prediction. It classifies phages as either **virulent** or **temperate** directly from nucleotide sequences. The model is based on BERT architecture and was trained on the **BACPHLIP dataset**, excluding *E. coli* sequences, allowing it to generalize to phages beyond the *E. coli* domain.
 By leveraging transfer learning, ProkBERT PhaStyle is optimized for handling **fragmented sequences**, commonly encountered in metagenomic and metavirome datasets. The model provides a fast, efficient alternative to traditional methods without requiring complex preprocessing pipelines or curated databases.
 ### Key Points:
 - **Trained on BACPHLIP** dataset excluding *E. coli* sequences.
 - **Segment Length** for training: 512 base pairs.
 - **Output**: Binary classification (virulent or temperate).
-- **Model Parameters**: ~21-26 million parameters depending on the variant used.
 ---
@@ -44,3 +43,29 @@ python bin/PhaStyle.py \
     --per_device_eval_batch_size 196
 ```

 # ProkBERT PhaStyle
 **Model Name**: neuralbioinfo/PhaStyle-mini
+**Model Type**: Genomic Language Model (ProkBERT-based)
 **Model Description**:
+ProkBERT PhaStyle is a fine-tuned genomic language model designed for phage lifestyle prediction. It classifies phages as either **virulent** or **temperate** directly from nucleotide sequences. The model is based on ProkBERT architecture and was trained on the **BACPHLIP dataset**, excluding *E. coli* sequences
 By leveraging transfer learning, ProkBERT PhaStyle is optimized for handling **fragmented sequences**, commonly encountered in metagenomic and metavirome datasets. The model provides a fast, efficient alternative to traditional methods without requiring complex preprocessing pipelines or curated databases.
 ### Key Points:
 - **Trained on BACPHLIP** dataset excluding *E. coli* sequences.
 - **Segment Length** for training: 512 base pairs.
 - **Output**: Binary classification (virulent or temperate).
+- **Model Parameters**: ~25 million parameters.
 ---
     --per_device_eval_batch_size 196
 ```
+### Datasets Used:
+- **BACPHLIP (without E. coli)**: 1,868 training sequences and 246 validation sequences.
+- **Guelin Collection**: 394 *Escherichia* phages (temperate and virulent types).
+- **EXTREMOPHILE Phages**: 16 phages isolated from extreme environments, including deep-sea, acidic, and arsenic-rich habitats.
+Each dataset was processed using **512bp segment lengths** to simulate fragmented metagenomic assemblies.
+---
+## Performance
+ProkBERT PhaStyle outperforms state-of-the-art models, especially in generalization and speed. It has been benchmarked on **short fragments** (512bp) and **phages from unseen environments**, demonstrating its robustness for both environmental and clinical datasets.
+### Key Metrics:
+- **Balanced Accuracy**: 0.94 (on 1022bp fragments from the *Escherichia* dataset)
+- **MCC (Matthews Correlation Coefficient)**: 0.91
+- **Sensitivity**: 0.97
+- **Specificity**: 0.91
+---
+## Limitations
+ProkBERT PhaStyle is specifically designed for **binary classification** of phage lifestyles (virulent vs. temperate) and does not handle non-phage sequences. It is recommended to use this model in conjunction with upstream pipelines that identify phage sequences. For large-scale inference, **GPU support** is strongly advised.