PyTorch
megatron-bert
ligeti commited on
Commit
99802f9
·
verified ·
1 Parent(s): a97eefb

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +46 -3
README.md CHANGED
@@ -1,3 +1,46 @@
1
- ---
2
- license: cc-by-nc-sa-4.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-nc-sa-4.0
3
+ ---
4
+ # ProkBERT PhaStyle
5
+
6
+ **Model Name**: neuralbioinfo/PhaStyle-mini
7
+ **Model Type**: Genomic Language Model (BERT-based)
8
+ **Model Description**:
9
+
10
+ ProkBERT PhaStyle is a fine-tuned genomic language model designed for phage lifestyle prediction. It classifies phages as either **virulent** or **temperate** directly from nucleotide sequences. The model is based on BERT architecture and was trained on the **BACPHLIP dataset**, excluding *E. coli* sequences, allowing it to generalize to phages beyond the *E. coli* domain.
11
+
12
+ By leveraging transfer learning, ProkBERT PhaStyle is optimized for handling **fragmented sequences**, commonly encountered in metagenomic and metavirome datasets. The model provides a fast, efficient alternative to traditional methods without requiring complex preprocessing pipelines or curated databases.
13
+
14
+ ### Key Points:
15
+ - **Trained on BACPHLIP** dataset excluding *E. coli* sequences.
16
+ - **Segment Length** for training: 512 base pairs.
17
+ - **Output**: Binary classification (virulent or temperate).
18
+ - **Model Parameters**: ~21-26 million parameters depending on the variant used.
19
+
20
+ ---
21
+
22
+ ## Intended Use
23
+
24
+ ProkBERT PhaStyle is designed for phage lifestyle prediction tasks, suitable for:
25
+
26
+ - **Phage Therapy**: Identifying virulent phages for bacterial infection treatment.
27
+ - **Microbiome Engineering**: Understanding the interaction between temperate and virulent phages in various microbiomes.
28
+ - **Metagenomic Studies**: Classifying fragmented phage sequences from environmental or clinical samples.
29
+
30
+ ### Inference Code
31
+
32
+ ProkBERT PhaStyle requires the **ProkBERT tokenizer** and a **custom classification model** (`BertForBinaryClassificationWithPooling`). Below is a high-level overview of how to use the model in inference mode:
33
+
34
+ ```python
35
+ aaa
36
+ ```
37
+
38
+ ```bash
39
+ python bin/PhaStyle.py \
40
+ --fastain data/EXTREMOPHILE/extremophiles.fasta \
41
+ --out output_predictions.tsv \
42
+ --ftmodel neuralbioinfo/PhaStyle-mini \
43
+ --modelclass BertForBinaryClassificationWithPooling \
44
+ --per_device_eval_batch_size 196
45
+
46
+ ```