ligeti commited on
Commit
2d41b08
·
verified ·
1 Parent(s): 16f37ae

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +86 -7
README.md CHANGED
@@ -1,12 +1,7 @@
1
- ProkBERT
2
- First release of ProkBERT
3
- ---
4
- license: cc-by-nc-4.0
5
- ---
6
 
7
  ## ProkBERT-mini Model
8
 
9
- ProkBERT-mini-k6s1 is part of the ProkBERT family of genomic language models, specifically designed for microbiome applications. This model, optimized for DNA sequence analysis, employs a unique tokenization strategy to effectively capture and interpret complex genomic data.
10
 
11
  ### Model Details
12
 
@@ -14,6 +9,7 @@ ProkBERT-mini-k6s1 is part of the ProkBERT family of genomic language models, sp
14
 
15
  **Architecture:** ProkBERT-mini-k6s1 is based on the MegatronBert architecture, a variant of the BERT model optimized for large-scale training. The model employs a learnable relative key-value positional embedding, mapping input vectors into a 384-dimensional space.
16
 
 
17
  **Tokenizer:** The model uses a 6-mer tokenizer with a shift of 1 (k6s1), specifically designed to handle DNA sequences efficiently.
18
 
19
  **Parameters:**
@@ -32,7 +28,85 @@ ProkBERT-mini-k6s1 is part of the ProkBERT family of genomic language models, sp
32
  - sequence classification tasks
33
  - Exploration of genomic patterns and features
34
 
35
- **Out-of-scope Uses:** Not intended for use in non-genomic contexts or applications outside the realm of bioinformatics.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
36
 
37
  ### Training Data and Process
38
 
@@ -62,3 +136,8 @@ Please report any issues with the model or its outputs to the Neural Bioinformat
62
 
63
  - **Model issues:** [GitHub repository link](https://github.com/nbrg-ppcu/prokbert)
64
  - **Feedback and inquiries:** [[email protected]](mailto:[email protected])
 
 
 
 
 
 
 
 
 
 
 
1
 
2
  ## ProkBERT-mini Model
3
 
4
+ ProkBERT-mini-k6s1 is part of the ProkBERT family of genomic language models, specifically designed for microbiome applications. This model, optimized for DNA sequence analysis. This model can provide robust and high resolution solutions.
5
 
6
  ### Model Details
7
 
 
9
 
10
  **Architecture:** ProkBERT-mini-k6s1 is based on the MegatronBert architecture, a variant of the BERT model optimized for large-scale training. The model employs a learnable relative key-value positional embedding, mapping input vectors into a 384-dimensional space.
11
 
12
+
13
  **Tokenizer:** The model uses a 6-mer tokenizer with a shift of 1 (k6s1), specifically designed to handle DNA sequences efficiently.
14
 
15
  **Parameters:**
 
28
  - sequence classification tasks
29
  - Exploration of genomic patterns and features
30
 
31
+ ## Segmentation and Tokenization in ProkBERT Models
32
+
33
+ ### Preprocessing Sequence Data
34
+ Transformer models, including ProkBERT, have a context size limitation. ProkBERT's design accommodates context sizes significantly larger than an average gene but smaller than the average bacterial genome.
35
+ The initial stage of our pipeline involves two primary steps: segmentation and tokenization.
36
+
37
+ #### Segmentation
38
+ Segmentation is crucial for Genomic Language Models (GLMs) as they process limited-size chunks of sequence data, typically ranging from 0 to 4kb. The sequence is divided into smaller parts through segmentation, which can be either contiguous, splitting the sequence into disjoint segments, or random, involving randomly sampling segments of length L.
39
+
40
+ The first practical step in segmentation involves loading the sequence from a FASTA file, often including the reverse complement of the sequence.
41
+
42
+ **Segmentation process:**
43
+ ![Segmentation Process](https://github.com/nbrg-ppcu/prokbert/blob/main/assets/Figure2_segmentation.png?raw=true)
44
+
45
+ #### Tokenization Process
46
+ After segmentation, sequences are encoded into a vector format. The LCA method allows the model to use a broader context and reduce computational demands while maintaining the information-rich local context.
47
+
48
+ Basic Steps for Preprocessing:
49
+
50
+ Load Fasta Files: Begin by loading the raw sequence data from FASTA files.
51
+ Segment the Raw Sequences: Apply segmentation parameters to split the sequences into manageable segments.
52
+ Tokenize the Segmented Database: Use the defined tokenization parameters to convert the segments into tokenized forms.
53
+ Create a Padded/Truncated Array: Generate a uniform array structure, padding or truncating as necessary.
54
+ Save the Array to HDF: Store the processed data in an HDF (Hierarchical Data Format) file for efficient retrieval and use in training models.
55
+
56
+ ```python
57
+ import pkg_resources
58
+ from os.path import join
59
+ from prokbert.sequtils import *
60
+
61
+ # Directory for pretraining FASTA files
62
+ pretraining_fasta_files_dir = pkg_resources.resource_filename('prokbert','data/pretraining')
63
+
64
+ # Define segmentation and tokenization parameters
65
+ segmentation_params = {
66
+ 'max_length': 256, # Split the sequence into segments of length L
67
+ 'min_length': 6,
68
+ 'type': 'random'
69
+ }
70
+ tokenization_parameters = {
71
+ 'kmer': 6,
72
+ 'shift': 1,
73
+ 'max_segment_length': 2003,
74
+ 'token_limit': 2000
75
+ }
76
+
77
+ # Setup configuration
78
+ defconfig = SeqConfig()
79
+ segmentation_params = defconfig.get_and_set_segmentation_parameters(segmentation_params)
80
+ tokenization_params = defconfig.get_and_set_tokenization_parameters(tokenization_parameters)
81
+
82
+ # Load and segment sequences
83
+ input_fasta_files = [join(pretraining_fasta_files_dir, file) for file in get_non_empty_files(pretraining_fasta_files_dir)]
84
+ sequences = load_contigs(input_fasta_files, IsAddHeader=True, adding_reverse_complement=True, AsDataFrame=True, to_uppercase=True, is_add_sequence_id=True)
85
+ segment_db = segment_sequences(sequences, segmentation_params, AsDataFrame=True)
86
+
87
+ # Tokenization
88
+ tokenized = batch_tokenize_segments_with_ids(segment_db, tokenization_params)
89
+ expected_max_token = max(len(arr) for arrays in tokenized.values() for arr in arrays)
90
+ X, torchdb = get_rectangular_array_from_tokenized_dataset(tokenized, tokenization_params['shift'], expected_max_token)
91
+
92
+ # Save to HDF file
93
+ hdf_file = '/tmp/pretraining.h5'
94
+ save_to_hdf(X, hdf_file, database=torchdb, compression=True)
95
+ ```
96
+
97
+
98
+
99
+ ### Installation of ProkBERT (if needed)
100
+
101
+ For setting up ProkBERT in your environment, you can install it using the following command (if not already installed):
102
+
103
+ ```python
104
+ try:
105
+ import prokbert
106
+ print("ProkBERT is already installed.")
107
+ except ImportError:
108
+ !pip install prokbert
109
+ print("Installed ProkBERT.")
110
 
111
  ### Training Data and Process
112
 
 
136
 
137
  - **Model issues:** [GitHub repository link](https://github.com/nbrg-ppcu/prokbert)
138
  - **Feedback and inquiries:** [[email protected]](mailto:[email protected])
139
+
140
+ ProkBERT
141
+ First release of ProkBERT
142
+ ---
143
+ license: cc-by-nc-4.0