Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,103 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: cc-by-nc-sa-4.0
|
3 |
+
widget:
|
4 |
+
- text: ACCTGA<mask>TTCTGAGTC
|
5 |
+
tags:
|
6 |
+
- DNA
|
7 |
+
- biology
|
8 |
+
- genomics
|
9 |
+
- segmentation
|
10 |
+
---
|
11 |
+
# segment-nt-30kb-multi-species
|
12 |
+
|
13 |
+
Segment-NT-30kb-multi-species is a segmentation model leveraging the [Nucleotide Transformer](https://huggingface.co/InstaDeepAI/nucleotide-transformer-v2-500m-multi-species) (NT) DNA foundation model to predict the location of several types of genomics
|
14 |
+
elements in a sequence at a single nucleotide resolution. It is the result of finetuning the [Segment-NT-30kb](https://huggingface.co/InstaDeepAI/segment_nt_30kb) model on a dataset encompassing the human genome
|
15 |
+
but also the genomes of 5 selected species: mouse, chicken, fly, zebrafish and worm.
|
16 |
+
|
17 |
+
For the finetuning on the multi-species genomes, we curated a dataset of a subset of the annotations used to train **Segment-NT-30kb**, mainly because only this subset of annotations is
|
18 |
+
available for these species. The annotations therefore concern the 7 main gene elements available from Ensembl [REF], namely protein-coding gene, 5’UTR, 3’UTR, intron, exon,
|
19 |
+
splice acceptor and donor sites.
|
20 |
+
|
21 |
+
|
22 |
+
**Developed by:** [InstaDeep](https://huggingface.co/InstaDeepAI)
|
23 |
+
|
24 |
+
### Model Sources
|
25 |
+
|
26 |
+
<!-- Provide the basic links for the model. -->
|
27 |
+
|
28 |
+
- **Repository:** [Nucleotide Transformer](https://github.com/instadeepai/nucleotide-transformer)
|
29 |
+
- **Paper:** [Segmenting the genome at single-nucleotide resolution with DNA foundation models]() TODO: Add link to preprint
|
30 |
+
|
31 |
+
### How to use
|
32 |
+
|
33 |
+
<!-- Need to adapt this section to our model. Need to figure out how to load the models from huggingface and do inference on them -->
|
34 |
+
Until its next release, the `transformers` library needs to be installed from source with the following command in order to use the models:
|
35 |
+
```bash
|
36 |
+
pip install --upgrade git+https://github.com/huggingface/transformers.git
|
37 |
+
```
|
38 |
+
|
39 |
+
A small snippet of code is given here in order to retrieve both logits and embeddings from a dummy DNA sequence.
|
40 |
+
```python
|
41 |
+
# Load model and tokenizer
|
42 |
+
from transformers import AutoTokenizer, AutoModel
|
43 |
+
import torch
|
44 |
+
|
45 |
+
tokenizer = AutoTokenizer.from_pretrained("InstaDeepAI/segment_nt_30kb_multi_species", use_auth_token=hf_token, trust_remote_code=True)
|
46 |
+
model = AutoModel.from_pretrained("InstaDeepAI/segment_nt_30kb_multi_species", use_auth_token=hf_token, trust_remote_code=True)
|
47 |
+
|
48 |
+
|
49 |
+
# Choose the length to which the input sequences are padded. By default, the
|
50 |
+
# model max length is chosen, but feel free to decrease it as the time taken to
|
51 |
+
# obtain the embeddings increases significantly with it.
|
52 |
+
max_length = tokenizer.model_max_length
|
53 |
+
|
54 |
+
# Create a dummy dna sequence and tokenize it
|
55 |
+
sequences = ["ATTCCGATTCCGATTCCG", "ATTTCTCTCTCTCTCTGAGATCGATCGATCGAT"]
|
56 |
+
tokens_ids = tokenizer.batch_encode_plus(sequences, return_tensors="pt", padding="max_length", max_length = max_length)["input_ids"]
|
57 |
+
|
58 |
+
# Compute the embeddings
|
59 |
+
attention_mask = torch_tokens != tokenizer.pad_token_id
|
60 |
+
outs = model(
|
61 |
+
torch_tokens,
|
62 |
+
attention_mask=attention_mask,
|
63 |
+
output_hidden_states=True
|
64 |
+
)
|
65 |
+
|
66 |
+
logits = outs.logits.detach().numpy()
|
67 |
+
probabilities = torch.nn.functional.softmax(logits, dim=-1)
|
68 |
+
```
|
69 |
+
|
70 |
+
|
71 |
+
## Training data
|
72 |
+
|
73 |
+
The **segment-nt-30kb-multi-species** model was finetuned on human, mouse, chicken, fly, zebrafish and worm genomes. For each specie, a subset of chromosomes is kept as
|
74 |
+
validation for training monitoring and test for final evaluation.
|
75 |
+
|
76 |
+
## Training procedure
|
77 |
+
|
78 |
+
### Preprocessing
|
79 |
+
|
80 |
+
The DNA sequences are tokenized using the Nucleotide Transformer Tokenizer, which tokenizes sequences as 6-mers tokens as described in the [Tokenization](https://github.com/instadeepai/nucleotide-transformer#tokenization-abc) section of the associated repository. This tokenizer has a vocabulary size of 4105. The inputs of the model are then of the form:
|
81 |
+
|
82 |
+
```
|
83 |
+
<CLS> <ACGTGT> <ACGTGC> <ACGGAC> <GACTAG> <TCAGCA>
|
84 |
+
```
|
85 |
+
|
86 |
+
### Training
|
87 |
+
|
88 |
+
The model was finetuned on a DGXH100 node with 8 GPUs on a total of 8B tokens for 3 days.
|
89 |
+
|
90 |
+
|
91 |
+
### Architecture
|
92 |
+
|
93 |
+
The model is composed of the [nucleotide-transformer-v2-50m-multi-species](https://huggingface.co/InstaDeepAI/nucleotide-transformer-v2-500m-multi-species) encoder, from which we removed
|
94 |
+
the language model head and replaced it by a 1-dimensional U-Net segmentation head [4] made of 2 downsampling convolutional blocks and 2 upsampling convolutional blocks. Each of these
|
95 |
+
blocks is made of 2 convolutional layers with 1, 024 and 2, 048 kernels respectively. This additional segmentation head accounts for 53 million parameters, bringing the total number of parameters
|
96 |
+
to 562M.
|
97 |
+
|
98 |
+
### BibTeX entry and citation info
|
99 |
+
|
100 |
+
#TODO: Add bibtex citation here
|
101 |
+
```bibtex
|
102 |
+
|
103 |
+
```
|