neuralbioinfo
/

prokbert-mini

@@ -1,4 +1,3 @@
 ## ProkBERT-mini Model
 ProkBERT-mini-k6s1 is part of the ProkBERT family of genomic language models, specifically designed for microbiome applications. This model, optimized for DNA sequence analysis. This model can provide robust and high resolution solutions.
@@ -33,6 +32,9 @@ ProkBERT-mini-k6s1 is part of the ProkBERT family of genomic language models, sp
 ### Preprocessing Sequence Data
 Transformer models, including ProkBERT, have a context size limitation. ProkBERT's design accommodates context sizes significantly larger than an average gene but smaller than the average bacterial genome.
 The initial stage of our pipeline involves two primary steps: segmentation and tokenization.
 #### Segmentation
 Segmentation is crucial for Genomic Language Models (GLMs) as they process limited-size chunks of sequence data, typically ranging from 0 to 4kb. The sequence is divided into smaller parts through segmentation, which can be either contiguous, splitting the sequence into disjoint segments, or random, involving randomly sampling segments of length L.
@@ -45,13 +47,13 @@ The first practical step in segmentation involves loading the sequence from a FA
 #### Tokenization Process
 After segmentation, sequences are encoded into a vector format. The LCA method allows the model to use a broader context and reduce computational demands while maintaining the information-rich local context.
-Basic Steps for Preprocessing:
-    Load Fasta Files: Begin by loading the raw sequence data from FASTA files.
-    Segment the Raw Sequences: Apply segmentation parameters to split the sequences into manageable segments.
-    Tokenize the Segmented Database: Use the defined tokenization parameters to convert the segments into tokenized forms.
-    Create a Padded/Truncated Array: Generate a uniform array structure, padding or truncating as necessary.
-    Save the Array to HDF: Store the processed data in an HDF (Hierarchical Data Format) file for efficient retrieval and use in training models.
 ```python
 import pkg_resources
@@ -113,7 +115,7 @@ except ImportError:
 **Overview:** The model was pretrained on a comprehensive dataset of genomic sequences to ensure broad coverage and robust learning.
 **Training Process:**
-- **Masked Language Modeling (MLM):** The MLM objective was modified for genomic sequences, involving an intricate strategy for masking overlapping k-mers.
 - **Training Phases:** The model underwent initial training with complete sequence restoration and selective masking, followed by a succeeding phase with variable-length datasets for increased complexity.
 ### Evaluation Results
@@ -137,7 +139,5 @@ Please report any issues with the model or its outputs to the Neural Bioinformat
 - **Model issues:** [GitHub repository link](https://github.com/nbrg-ppcu/prokbert)
 - **Feedback and inquiries:** [[email protected]](mailto:[email protected])
-ProkBERT
-First release of ProkBERT
 ---
 license: cc-by-nc-4.0

 ## ProkBERT-mini Model
 ProkBERT-mini-k6s1 is part of the ProkBERT family of genomic language models, specifically designed for microbiome applications. This model, optimized for DNA sequence analysis. This model can provide robust and high resolution solutions.
 ### Preprocessing Sequence Data
 Transformer models, including ProkBERT, have a context size limitation. ProkBERT's design accommodates context sizes significantly larger than an average gene but smaller than the average bacterial genome.
 The initial stage of our pipeline involves two primary steps: segmentation and tokenization.
+For more details about tokenization, please see the following notebook: [Tokenization Notebook in Google Colab](https://colab.research.google.com/github/nbrg-ppcu/prokbert/blob/main/examples/Tokenization.ipynb).
+For more details about segmentation, please see the following notebook: [Segmentation Notebook in Google Colab](https://colab.research.google.com/github/nbrg-ppcu/prokbert/blob/main/examples/Segmentation.ipynb).
 #### Segmentation
 Segmentation is crucial for Genomic Language Models (GLMs) as they process limited-size chunks of sequence data, typically ranging from 0 to 4kb. The sequence is divided into smaller parts through segmentation, which can be either contiguous, splitting the sequence into disjoint segments, or random, involving randomly sampling segments of length L.
 #### Tokenization Process
 After segmentation, sequences are encoded into a vector format. The LCA method allows the model to use a broader context and reduce computational demands while maintaining the information-rich local context.
+## Basic Steps for Preprocessing:
+1. **Load Fasta Files**: Begin by loading the raw sequence data from FASTA files.
+2. **Segment the Raw Sequences**: Apply segmentation parameters to split the sequences into manageable segments.
+3. **Tokenize the Segmented Database**: Use the defined tokenization parameters to convert the segments into tokenized forms.
+4. **Create a Padded/Truncated Array**: Generate a uniform array structure, padding or truncating as necessary.
+5. **Save the Array to HDF**: Store the processed data in an HDF (Hierarchical Data Format) file for efficient retrieval and use in training models.
 ```python
 import pkg_resources
 **Overview:** The model was pretrained on a comprehensive dataset of genomic sequences to ensure broad coverage and robust learning.
 **Training Process:**
+- **Masked Language Modeling (MLM):** The MLM objective was modified for genomic sequences for masking overlapping k-mers.
 - **Training Phases:** The model underwent initial training with complete sequence restoration and selective masking, followed by a succeeding phase with variable-length datasets for increased complexity.
 ### Evaluation Results
 - **Model issues:** [GitHub repository link](https://github.com/nbrg-ppcu/prokbert)
 - **Feedback and inquiries:** [[email protected]](mailto:[email protected])
 ---
 license: cc-by-nc-4.0