neuralbioinfo
/

prokbert-mini

@@ -1,12 +1,7 @@
-ProkBERT
-First release of ProkBERT
----
-license: cc-by-nc-4.0
----
 ## ProkBERT-mini Model
-ProkBERT-mini-k6s1 is part of the ProkBERT family of genomic language models, specifically designed for microbiome applications. This model, optimized for DNA sequence analysis, employs a unique tokenization strategy to effectively capture and interpret complex genomic data.
 ### Model Details
@@ -14,6 +9,7 @@ ProkBERT-mini-k6s1 is part of the ProkBERT family of genomic language models, sp
 **Architecture:** ProkBERT-mini-k6s1 is based on the MegatronBert architecture, a variant of the BERT model optimized for large-scale training. The model employs a learnable relative key-value positional embedding, mapping input vectors into a 384-dimensional space.
 **Tokenizer:** The model uses a 6-mer tokenizer with a shift of 1 (k6s1), specifically designed to handle DNA sequences efficiently.
 **Parameters:**
@@ -32,7 +28,85 @@ ProkBERT-mini-k6s1 is part of the ProkBERT family of genomic language models, sp
 - sequence classification tasks
 - Exploration of genomic patterns and features
-**Out-of-scope Uses:** Not intended for use in non-genomic contexts or applications outside the realm of bioinformatics.
 ### Training Data and Process
@@ -62,3 +136,8 @@ Please report any issues with the model or its outputs to the Neural Bioinformat
 - **Model issues:** [GitHub repository link](https://github.com/nbrg-ppcu/prokbert)
 - **Feedback and inquiries:** [[email protected]](mailto:[email protected])

 ## ProkBERT-mini Model
+ProkBERT-mini-k6s1 is part of the ProkBERT family of genomic language models, specifically designed for microbiome applications. This model, optimized for DNA sequence analysis. This model can provide robust and high resolution solutions.
 ### Model Details
 **Architecture:** ProkBERT-mini-k6s1 is based on the MegatronBert architecture, a variant of the BERT model optimized for large-scale training. The model employs a learnable relative key-value positional embedding, mapping input vectors into a 384-dimensional space.
 **Tokenizer:** The model uses a 6-mer tokenizer with a shift of 1 (k6s1), specifically designed to handle DNA sequences efficiently.
 **Parameters:**
 - sequence classification tasks
 - Exploration of genomic patterns and features
+## Segmentation and Tokenization in ProkBERT Models
+### Preprocessing Sequence Data
+Transformer models, including ProkBERT, have a context size limitation. ProkBERT's design accommodates context sizes significantly larger than an average gene but smaller than the average bacterial genome.
+The initial stage of our pipeline involves two primary steps: segmentation and tokenization.
+#### Segmentation
+Segmentation is crucial for Genomic Language Models (GLMs) as they process limited-size chunks of sequence data, typically ranging from 0 to 4kb. The sequence is divided into smaller parts through segmentation, which can be either contiguous, splitting the sequence into disjoint segments, or random, involving randomly sampling segments of length L.
+The first practical step in segmentation involves loading the sequence from a FASTA file, often including the reverse complement of the sequence.
+**Segmentation process:**
+![Segmentation Process](https://github.com/nbrg-ppcu/prokbert/blob/main/assets/Figure2_segmentation.png?raw=true)
+#### Tokenization Process
+After segmentation, sequences are encoded into a vector format. The LCA method allows the model to use a broader context and reduce computational demands while maintaining the information-rich local context.
+Basic Steps for Preprocessing:
+    Load Fasta Files: Begin by loading the raw sequence data from FASTA files.
+    Segment the Raw Sequences: Apply segmentation parameters to split the sequences into manageable segments.
+    Tokenize the Segmented Database: Use the defined tokenization parameters to convert the segments into tokenized forms.
+    Create a Padded/Truncated Array: Generate a uniform array structure, padding or truncating as necessary.
+    Save the Array to HDF: Store the processed data in an HDF (Hierarchical Data Format) file for efficient retrieval and use in training models.
+```python
+import pkg_resources
+from os.path import join
+from prokbert.sequtils import *
+# Directory for pretraining FASTA files
+pretraining_fasta_files_dir = pkg_resources.resource_filename('prokbert','data/pretraining')
+# Define segmentation and tokenization parameters
+segmentation_params = {
+    'max_length': 256,  # Split the sequence into segments of length L
+    'min_length': 6,
+    'type': 'random'
+}
+tokenization_parameters = {
+    'kmer': 6,
+    'shift': 1,
+    'max_segment_length': 2003,
+    'token_limit': 2000
+}
+# Setup configuration
+defconfig = SeqConfig()
+segmentation_params = defconfig.get_and_set_segmentation_parameters(segmentation_params)
+tokenization_params = defconfig.get_and_set_tokenization_parameters(tokenization_parameters)
+# Load and segment sequences
+input_fasta_files = [join(pretraining_fasta_files_dir, file) for file in get_non_empty_files(pretraining_fasta_files_dir)]
+sequences = load_contigs(input_fasta_files, IsAddHeader=True, adding_reverse_complement=True, AsDataFrame=True, to_uppercase=True, is_add_sequence_id=True)
+segment_db = segment_sequences(sequences, segmentation_params, AsDataFrame=True)
+# Tokenization
+tokenized = batch_tokenize_segments_with_ids(segment_db, tokenization_params)
+expected_max_token = max(len(arr) for arrays in tokenized.values() for arr in arrays)
+X, torchdb = get_rectangular_array_from_tokenized_dataset(tokenized, tokenization_params['shift'], expected_max_token)
+# Save to HDF file
+hdf_file = '/tmp/pretraining.h5'
+save_to_hdf(X, hdf_file, database=torchdb, compression=True)
+```
+### Installation of ProkBERT (if needed)
+For setting up ProkBERT in your environment, you can install it using the following command (if not already installed):
+```python
+try:
+    import prokbert
+    print("ProkBERT is already installed.")
+except ImportError:
+    !pip install prokbert
+    print("Installed ProkBERT.")
 ### Training Data and Process
 - **Model issues:** [GitHub repository link](https://github.com/nbrg-ppcu/prokbert)
 - **Feedback and inquiries:** [[email protected]](mailto:[email protected])
+ProkBERT
+First release of ProkBERT
+---
+license: cc-by-nc-4.0