ligeti commited on
Commit
f1bb9ca
·
verified ·
1 Parent(s): 2d41b08

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +10 -10
README.md CHANGED
@@ -1,4 +1,3 @@
1
-
2
  ## ProkBERT-mini Model
3
 
4
  ProkBERT-mini-k6s1 is part of the ProkBERT family of genomic language models, specifically designed for microbiome applications. This model, optimized for DNA sequence analysis. This model can provide robust and high resolution solutions.
@@ -33,6 +32,9 @@ ProkBERT-mini-k6s1 is part of the ProkBERT family of genomic language models, sp
33
  ### Preprocessing Sequence Data
34
  Transformer models, including ProkBERT, have a context size limitation. ProkBERT's design accommodates context sizes significantly larger than an average gene but smaller than the average bacterial genome.
35
  The initial stage of our pipeline involves two primary steps: segmentation and tokenization.
 
 
 
36
 
37
  #### Segmentation
38
  Segmentation is crucial for Genomic Language Models (GLMs) as they process limited-size chunks of sequence data, typically ranging from 0 to 4kb. The sequence is divided into smaller parts through segmentation, which can be either contiguous, splitting the sequence into disjoint segments, or random, involving randomly sampling segments of length L.
@@ -45,13 +47,13 @@ The first practical step in segmentation involves loading the sequence from a FA
45
  #### Tokenization Process
46
  After segmentation, sequences are encoded into a vector format. The LCA method allows the model to use a broader context and reduce computational demands while maintaining the information-rich local context.
47
 
48
- Basic Steps for Preprocessing:
49
 
50
- Load Fasta Files: Begin by loading the raw sequence data from FASTA files.
51
- Segment the Raw Sequences: Apply segmentation parameters to split the sequences into manageable segments.
52
- Tokenize the Segmented Database: Use the defined tokenization parameters to convert the segments into tokenized forms.
53
- Create a Padded/Truncated Array: Generate a uniform array structure, padding or truncating as necessary.
54
- Save the Array to HDF: Store the processed data in an HDF (Hierarchical Data Format) file for efficient retrieval and use in training models.
55
 
56
  ```python
57
  import pkg_resources
@@ -113,7 +115,7 @@ except ImportError:
113
  **Overview:** The model was pretrained on a comprehensive dataset of genomic sequences to ensure broad coverage and robust learning.
114
 
115
  **Training Process:**
116
- - **Masked Language Modeling (MLM):** The MLM objective was modified for genomic sequences, involving an intricate strategy for masking overlapping k-mers.
117
  - **Training Phases:** The model underwent initial training with complete sequence restoration and selective masking, followed by a succeeding phase with variable-length datasets for increased complexity.
118
 
119
  ### Evaluation Results
@@ -137,7 +139,5 @@ Please report any issues with the model or its outputs to the Neural Bioinformat
137
  - **Model issues:** [GitHub repository link](https://github.com/nbrg-ppcu/prokbert)
138
  - **Feedback and inquiries:** [[email protected]](mailto:[email protected])
139
 
140
- ProkBERT
141
- First release of ProkBERT
142
  ---
143
  license: cc-by-nc-4.0
 
 
1
  ## ProkBERT-mini Model
2
 
3
  ProkBERT-mini-k6s1 is part of the ProkBERT family of genomic language models, specifically designed for microbiome applications. This model, optimized for DNA sequence analysis. This model can provide robust and high resolution solutions.
 
32
  ### Preprocessing Sequence Data
33
  Transformer models, including ProkBERT, have a context size limitation. ProkBERT's design accommodates context sizes significantly larger than an average gene but smaller than the average bacterial genome.
34
  The initial stage of our pipeline involves two primary steps: segmentation and tokenization.
35
+ For more details about tokenization, please see the following notebook: [Tokenization Notebook in Google Colab](https://colab.research.google.com/github/nbrg-ppcu/prokbert/blob/main/examples/Tokenization.ipynb).
36
+
37
+ For more details about segmentation, please see the following notebook: [Segmentation Notebook in Google Colab](https://colab.research.google.com/github/nbrg-ppcu/prokbert/blob/main/examples/Segmentation.ipynb).
38
 
39
  #### Segmentation
40
  Segmentation is crucial for Genomic Language Models (GLMs) as they process limited-size chunks of sequence data, typically ranging from 0 to 4kb. The sequence is divided into smaller parts through segmentation, which can be either contiguous, splitting the sequence into disjoint segments, or random, involving randomly sampling segments of length L.
 
47
  #### Tokenization Process
48
  After segmentation, sequences are encoded into a vector format. The LCA method allows the model to use a broader context and reduce computational demands while maintaining the information-rich local context.
49
 
50
+ ## Basic Steps for Preprocessing:
51
 
52
+ 1. **Load Fasta Files**: Begin by loading the raw sequence data from FASTA files.
53
+ 2. **Segment the Raw Sequences**: Apply segmentation parameters to split the sequences into manageable segments.
54
+ 3. **Tokenize the Segmented Database**: Use the defined tokenization parameters to convert the segments into tokenized forms.
55
+ 4. **Create a Padded/Truncated Array**: Generate a uniform array structure, padding or truncating as necessary.
56
+ 5. **Save the Array to HDF**: Store the processed data in an HDF (Hierarchical Data Format) file for efficient retrieval and use in training models.
57
 
58
  ```python
59
  import pkg_resources
 
115
  **Overview:** The model was pretrained on a comprehensive dataset of genomic sequences to ensure broad coverage and robust learning.
116
 
117
  **Training Process:**
118
+ - **Masked Language Modeling (MLM):** The MLM objective was modified for genomic sequences for masking overlapping k-mers.
119
  - **Training Phases:** The model underwent initial training with complete sequence restoration and selective masking, followed by a succeeding phase with variable-length datasets for increased complexity.
120
 
121
  ### Evaluation Results
 
139
  - **Model issues:** [GitHub repository link](https://github.com/nbrg-ppcu/prokbert)
140
  - **Feedback and inquiries:** [[email protected]](mailto:[email protected])
141
 
 
 
142
  ---
143
  license: cc-by-nc-4.0