Update README.md
Browse files
README.md
CHANGED
@@ -8,7 +8,6 @@ pipeline_tag: text-generation
|
|
8 |
[ChatNT](https://www.biorxiv.org/content/10.1101/2024.04.30.591835v1) is the first multimodal conversational agent designed with a deep understanding of biological sequences (DNA, RNA, proteins).
|
9 |
It enables users — even those with no coding background — to interact with biological data through natural language and it generalizes across multiple biological tasks and modalities.
|
10 |
|
11 |
-
|
12 |
**Developed by:** [InstaDeep](https://huggingface.co/InstaDeepAI)
|
13 |
|
14 |
### Model Sources
|
@@ -19,7 +18,34 @@ It enables users — even those with no coding background — to interact with b
|
|
19 |
- **Paper:** [ChatNT: A Multimodal Conversational Agent for DNA, RNA and Protein Tasks](https://www.biorxiv.org/content/10.1101/2024.04.30.591835v1.full.pdf)
|
20 |
|
21 |
|
22 |
-
###
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
23 |
|
24 |
Until its next release, the transformers library needs to be installed from source with the following command in order to use the models.
|
25 |
PyTorch should also be installed.
|
@@ -30,6 +56,9 @@ pip install torch
|
|
30 |
```
|
31 |
|
32 |
A small snippet of code is given here in order to **generate ChatNT answers from a pipeline (high-level)**.
|
|
|
|
|
|
|
33 |
|
34 |
```
|
35 |
# Load pipeline
|
@@ -63,6 +92,7 @@ english_tokenizer = AutoTokenizer.from_pretrained("InstaDeepAI/ChatNT", subfolde
|
|
63 |
bio_tokenizer = AutoTokenizer.from_pretrained("InstaDeepAI/ChatNT", subfolder="bio_tokenizer")
|
64 |
|
65 |
# Define custom inputs (note that the number of <DNA> token in the english sequence must be equal to len(dna_sequences))
|
|
|
66 |
english_sequence = "A chat between a curious user and an artificial intelligence assistant that can handle bio sequences. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: Is there any evidence of an acceptor splice site in this sequence <DNA> ?"
|
67 |
dna_sequences = ["ATCGGAAAAAGATCCAGAAAGTTATACCAGGCCAATGGGAATCACCTATTACGTGGATAATAGCGATAGTATGTTACCTATAAATTTAACTACGTGGATATCAGGCAGTTACGTTACCAGTCAAGGAGCACCCAAAACTGTCCAGCAACAAGTTAATTTACCCATGAAGATGTACTGCAAGCCTTGCCAACCAGTTAAAGTAGCTACTCATAAGGTAATAAACAGTAATATCGACTTTTTATCCATTTTGATAATTGATTTATAACAGTCTATAACTGATCGCTCTACATAATCTCTATCAGATTACTATTGACACAAACAGAAACCCCGTTAATTTGTATGATATATTTCCCGGTAAGCTTCGATTTTTAATCCTATCGTGACAATTTGGAATGTAACTTATTTCGTATAGGATAAACTAATTTACACGTTTGAATTCCTAGAATATGGAGAATCTAAAGGTCCTGGCAATGCCATCGGCTTTCAATATTATAATGGACCAAAAGTTACTCTATTAGCTTCCAAAACTTCGCGTGAGTACATTAGAACAGAAGAATAACCTTCAATATCGAGAGAGTTACTATCACTAACTATCCTATG"]
|
68 |
|
|
|
8 |
[ChatNT](https://www.biorxiv.org/content/10.1101/2024.04.30.591835v1) is the first multimodal conversational agent designed with a deep understanding of biological sequences (DNA, RNA, proteins).
|
9 |
It enables users — even those with no coding background — to interact with biological data through natural language and it generalizes across multiple biological tasks and modalities.
|
10 |
|
|
|
11 |
**Developed by:** [InstaDeep](https://huggingface.co/InstaDeepAI)
|
12 |
|
13 |
### Model Sources
|
|
|
18 |
- **Paper:** [ChatNT: A Multimodal Conversational Agent for DNA, RNA and Protein Tasks](https://www.biorxiv.org/content/10.1101/2024.04.30.591835v1.full.pdf)
|
19 |
|
20 |
|
21 |
+
### Architecture and Parameters
|
22 |
+
ChatNT is built on a three‑module design: a 500M‑parameter [Nucleotide Transformer v2](https://www.nature.com/articles/s41592-024-02523-z) DNA encoder pre‑trained on genomes from 850 species
|
23 |
+
(handling up to 12 kb per sequence, Dalla‑Torre et al., 2024), an English‑aware Perceiver Resampler that linearly projects and gated‑attention compresses
|
24 |
+
2048 DNA‑token embeddings into 64 task‑conditioned vectors (REF), and a frozen 7B‑parameter [Vicuna‑7B](https://lmsys.org/blog/2023-03-30-vicuna/) decoder.
|
25 |
+
|
26 |
+
Users provide a natural‑language prompt containing one or more `<DNA>` placeholders and the corresponding DNA sequences (tokenized as 6‑mers).
|
27 |
+
The projection layer inserts 64 resampled DNA embeddings at each placeholder, and the Vicuna decoder generates free‑form English responses in
|
28 |
+
an autoregressive fashion, using low‑temperature sampling to produce classification labels, multi‑label statements, or numeric values.
|
29 |
+
|
30 |
+
### Training Data
|
31 |
+
ChatNT was instruction‑tuned on a unified corpus covering 27 diverse tasks from DNA, RNA and proteins, spanning multiple species, tissues and biological processes.
|
32 |
+
This amounted to 605 million DNA tokens (≈ 3.6 billion bases) and 273 million English tokens, sampled uniformly over tasks for 2 billion instruction tokens.
|
33 |
+
|
34 |
+
### Tokenization
|
35 |
+
DNA inputs are broken into overlapping 6‑mer tokens and padded or truncated to 2048 tokens (~ 12 kb). English prompts and
|
36 |
+
outputs use the LLaMA tokenizer, augmented with `<DNA>` as a special token to mark sequence insertion points.
|
37 |
+
|
38 |
+
### Credit and License
|
39 |
+
The DNA encoder is the Nucleotide Transformer v2 ([Dalla‑Torre et al., 2024](https://www.nature.com/articles/s41592-024-02523-z)), and the English decoder is Vicuna‑7B (
|
40 |
+
[Chiang et al., 2023](https://lmsys.org/blog/2023-03-30-vicuna/)). All code and model artifacts are released under ???.
|
41 |
+
|
42 |
+
### Limitations and Disclaimer
|
43 |
+
While ChatNT excels at conversational molecular‑phenotype tasks, it is **not** a clinical or diagnostic tool. It can produce incorrect or
|
44 |
+
“hallucinated” answers, particularly on out‑of‑distribution inputs, and its numeric predictions may suffer digit‑level errors. Confidence
|
45 |
+
estimates require post‑hoc calibration. Users should always validate critical outputs against experiments or specialized bioinformatics
|
46 |
+
pipelines.
|
47 |
+
|
48 |
+
## How to use
|
49 |
|
50 |
Until its next release, the transformers library needs to be installed from source with the following command in order to use the models.
|
51 |
PyTorch should also be installed.
|
|
|
56 |
```
|
57 |
|
58 |
A small snippet of code is given here in order to **generate ChatNT answers from a pipeline (high-level)**.
|
59 |
+
- The prompt used for training ChatNT is already incorporated inside the pipeline and is the following:
|
60 |
+
"A chat between a curious user and an artificial intelligence assistant that can handle bio sequences. The assistant gives helpful,
|
61 |
+
detailed, and polite answers to the user's questions."
|
62 |
|
63 |
```
|
64 |
# Load pipeline
|
|
|
92 |
bio_tokenizer = AutoTokenizer.from_pretrained("InstaDeepAI/ChatNT", subfolder="bio_tokenizer")
|
93 |
|
94 |
# Define custom inputs (note that the number of <DNA> token in the english sequence must be equal to len(dna_sequences))
|
95 |
+
# Here the english sequence should include the prompt
|
96 |
english_sequence = "A chat between a curious user and an artificial intelligence assistant that can handle bio sequences. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: Is there any evidence of an acceptor splice site in this sequence <DNA> ?"
|
97 |
dna_sequences = ["ATCGGAAAAAGATCCAGAAAGTTATACCAGGCCAATGGGAATCACCTATTACGTGGATAATAGCGATAGTATGTTACCTATAAATTTAACTACGTGGATATCAGGCAGTTACGTTACCAGTCAAGGAGCACCCAAAACTGTCCAGCAACAAGTTAATTTACCCATGAAGATGTACTGCAAGCCTTGCCAACCAGTTAAAGTAGCTACTCATAAGGTAATAAACAGTAATATCGACTTTTTATCCATTTTGATAATTGATTTATAACAGTCTATAACTGATCGCTCTACATAATCTCTATCAGATTACTATTGACACAAACAGAAACCCCGTTAATTTGTATGATATATTTCCCGGTAAGCTTCGATTTTTAATCCTATCGTGACAATTTGGAATGTAACTTATTTCGTATAGGATAAACTAATTTACACGTTTGAATTCCTAGAATATGGAGAATCTAAAGGTCCTGGCAATGCCATCGGCTTTCAATATTATAATGGACCAAAAGTTACTCTATTAGCTTCCAAAACTTCGCGTGAGTACATTAGAACAGAAGAATAACCTTCAATATCGAGAGAGTTACTATCACTAACTATCCTATG"]
|
98 |
|