File size: 3,109 Bytes
a42f6b4
 
 
 
 
 
 
 
992edb4
4e623ab
1d814e0
7b16d94
4e623ab
1d814e0
 
 
b8e010b
906e0d7
4e623ab
1d814e0
72b748d
43fddb6
72b748d
43fddb6
e4b93aa
4e623ab
1d814e0
4e623ab
1d814e0
4e623ab
52359bd
 
 
 
1d814e0
23072ad
 
 
 
 
 
 
 
961a8bf
23072ad
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4e623ab
1d814e0
4e623ab
1d814e0
 
e46d4c3
76caca5
e3c51cd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
---
license: mit
pipeline_tag: mask-generation
tags:
- biology
- metagenomics
- bigbird
---


**Model Overview:**
The model presented in this paper builds on the BigBird architecture with a similar approach detailed in our paper titled "Leveraging Large Language Models for Metagenomic Analysis" This model is optimized to enhance the performance of BigBird for large gene sequence data. Trained specifically on gene sequences, it aims to uncover valuable insights within metagenomic data and is evaluated across various tasks, including classification and sequence embedding.

**Model Architecture:**
- **Base Model:** BigBird transformer architecture
- **Tokenizer:** Custom K-mer Tokenizer with k-mer length of 6 and overlapping tokens
- **Training:**  Trained on a diverse dataset of 497 housekeeping genes from 2000 bacterial and archaeal genomes
- **Embeddings:** Generates sequence embeddings using  mean pooling of hidden states

**Dataset:**
Scorpio Gene-Taxa Benchmark Dataset:

https://zenodo.org/records/12964684

https://huggingface.co/datasets/MsAlEhR/scorpio-gene-taxa

**Steps to Use the Model:**

1. **Install KmerTokenizer:**

2. ```sh
   pip install git+https://github.com/MsAlEhR/KmerTokenizer.git
    ```
3. **Example Code:**
   ```python
    from KmerTokenizer import KmerTokenizer
    from transformers import AutoModel
    import torch
    
    # Example gene sequence
    seq = "ATTTTTTTTTTTCCCCCCCCCCCGGGGGGGGATCGATGC"
    
    # Initialize the tokenizer
    tokenizer = KmerTokenizer(kmerlen=6, overlapping=True, maxlen=4096)
    tokenized_output = tokenizer.kmer_tokenize(seq)
    pad_token_id = 2  # Set pad token ID
    
    # Create attention mask (1 for tokens, 0 for padding)
    attention_mask = torch.tensor([1 if token != pad_token_id else 0 for token in tokenized_output], dtype=torch.long).unsqueeze(0)
    
    # Convert tokenized output to LongTensor and add batch dimension
    inputs = torch.tensor([tokenized_output], dtype=torch.long)
    
    # Load the pre-trained BigBird model
    model = AutoModel.from_pretrained("MsAlEhR/MetaBERTa-bigbird-gene", output_hidden_states=True)
    
    # Generate hidden states
    outputs = model(input_ids=inputs, attention_mask=attention_mask)
    
    # Get embeddings from the last hidden state
    embeddings = outputs.hidden_states[-1]  
    
    # Expand attention mask to match the embedding dimensions
    expanded_attention_mask = attention_mask.unsqueeze(-1) 
    
    # Compute mean sequence embeddings
    mean_sequence_embeddings = torch.sum(expanded_attention_mask * embeddings, dim=1) / torch.sum(expanded_attention_mask, dim=1)

   ```

**Citation:**
For a detailed overview of leveraging large language models for metagenomic analysis, refer to our paper:
> Refahi, M.S., Sokhansanj, B.A., & Rosen, G.L. (2023). Leveraging Large Language Models for Metagenomic Analysis. *IEEE SPMB*.
> 
> Refahi, M., Sokhansanj, B.A., Mell, J.C., Brown, J., Yoo, H., Hearne, G. and Rosen, G., 2025. Enhancing nucleotide sequence representations in genomic analysis with contrastive optimization Communications Biology, Nature.