bernardo-de-almeida commited on
Commit
4bfdead
·
verified ·
1 Parent(s): 9f87e41

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +32 -2
README.md CHANGED
@@ -8,7 +8,6 @@ pipeline_tag: text-generation
8
  [ChatNT](https://www.biorxiv.org/content/10.1101/2024.04.30.591835v1) is the first multimodal conversational agent designed with a deep understanding of biological sequences (DNA, RNA, proteins).
9
  It enables users — even those with no coding background — to interact with biological data through natural language and it generalizes across multiple biological tasks and modalities.
10
 
11
-
12
  **Developed by:** [InstaDeep](https://huggingface.co/InstaDeepAI)
13
 
14
  ### Model Sources
@@ -19,7 +18,34 @@ It enables users — even those with no coding background — to interact with b
19
  - **Paper:** [ChatNT: A Multimodal Conversational Agent for DNA, RNA and Protein Tasks](https://www.biorxiv.org/content/10.1101/2024.04.30.591835v1.full.pdf)
20
 
21
 
22
- ### How to use
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
23
 
24
  Until its next release, the transformers library needs to be installed from source with the following command in order to use the models.
25
  PyTorch should also be installed.
@@ -30,6 +56,9 @@ pip install torch
30
  ```
31
 
32
  A small snippet of code is given here in order to **generate ChatNT answers from a pipeline (high-level)**.
 
 
 
33
 
34
  ```
35
  # Load pipeline
@@ -63,6 +92,7 @@ english_tokenizer = AutoTokenizer.from_pretrained("InstaDeepAI/ChatNT", subfolde
63
  bio_tokenizer = AutoTokenizer.from_pretrained("InstaDeepAI/ChatNT", subfolder="bio_tokenizer")
64
 
65
  # Define custom inputs (note that the number of <DNA> token in the english sequence must be equal to len(dna_sequences))
 
66
  english_sequence = "A chat between a curious user and an artificial intelligence assistant that can handle bio sequences. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: Is there any evidence of an acceptor splice site in this sequence <DNA> ?"
67
  dna_sequences = ["ATCGGAAAAAGATCCAGAAAGTTATACCAGGCCAATGGGAATCACCTATTACGTGGATAATAGCGATAGTATGTTACCTATAAATTTAACTACGTGGATATCAGGCAGTTACGTTACCAGTCAAGGAGCACCCAAAACTGTCCAGCAACAAGTTAATTTACCCATGAAGATGTACTGCAAGCCTTGCCAACCAGTTAAAGTAGCTACTCATAAGGTAATAAACAGTAATATCGACTTTTTATCCATTTTGATAATTGATTTATAACAGTCTATAACTGATCGCTCTACATAATCTCTATCAGATTACTATTGACACAAACAGAAACCCCGTTAATTTGTATGATATATTTCCCGGTAAGCTTCGATTTTTAATCCTATCGTGACAATTTGGAATGTAACTTATTTCGTATAGGATAAACTAATTTACACGTTTGAATTCCTAGAATATGGAGAATCTAAAGGTCCTGGCAATGCCATCGGCTTTCAATATTATAATGGACCAAAAGTTACTCTATTAGCTTCCAAAACTTCGCGTGAGTACATTAGAACAGAAGAATAACCTTCAATATCGAGAGAGTTACTATCACTAACTATCCTATG"]
68
 
 
8
  [ChatNT](https://www.biorxiv.org/content/10.1101/2024.04.30.591835v1) is the first multimodal conversational agent designed with a deep understanding of biological sequences (DNA, RNA, proteins).
9
  It enables users — even those with no coding background — to interact with biological data through natural language and it generalizes across multiple biological tasks and modalities.
10
 
 
11
  **Developed by:** [InstaDeep](https://huggingface.co/InstaDeepAI)
12
 
13
  ### Model Sources
 
18
  - **Paper:** [ChatNT: A Multimodal Conversational Agent for DNA, RNA and Protein Tasks](https://www.biorxiv.org/content/10.1101/2024.04.30.591835v1.full.pdf)
19
 
20
 
21
+ ### Architecture and Parameters
22
+ ChatNT is built on a three‑module design: a 500M‑parameter [Nucleotide Transformer v2](https://www.nature.com/articles/s41592-024-02523-z) DNA encoder pre‑trained on genomes from 850 species
23
+ (handling up to 12 kb per sequence, Dalla‑Torre et al., 2024), an English‑aware Perceiver Resampler that linearly projects and gated‑attention compresses
24
+ 2048 DNA‑token embeddings into 64 task‑conditioned vectors (REF), and a frozen 7B‑parameter [Vicuna‑7B](https://lmsys.org/blog/2023-03-30-vicuna/) decoder.
25
+
26
+ Users provide a natural‑language prompt containing one or more `<DNA>` placeholders and the corresponding DNA sequences (tokenized as 6‑mers).
27
+ The projection layer inserts 64 resampled DNA embeddings at each placeholder, and the Vicuna decoder generates free‑form English responses in
28
+ an autoregressive fashion, using low‑temperature sampling to produce classification labels, multi‑label statements, or numeric values.
29
+
30
+ ### Training Data
31
+ ChatNT was instruction‑tuned on a unified corpus covering 27 diverse tasks from DNA, RNA and proteins, spanning multiple species, tissues and biological processes.
32
+ This amounted to 605 million DNA tokens (≈ 3.6 billion bases) and 273 million English tokens, sampled uniformly over tasks for 2 billion instruction tokens.
33
+
34
+ ### Tokenization
35
+ DNA inputs are broken into overlapping 6‑mer tokens and padded or truncated to 2048 tokens (~ 12 kb). English prompts and
36
+ outputs use the LLaMA tokenizer, augmented with `<DNA>` as a special token to mark sequence insertion points.
37
+
38
+ ### Credit and License
39
+ The DNA encoder is the Nucleotide Transformer v2 ([Dalla‑Torre et al., 2024](https://www.nature.com/articles/s41592-024-02523-z)), and the English decoder is Vicuna‑7B (
40
+ [Chiang et al., 2023](https://lmsys.org/blog/2023-03-30-vicuna/)). All code and model artifacts are released under ???.
41
+
42
+ ### Limitations and Disclaimer
43
+ While ChatNT excels at conversational molecular‑phenotype tasks, it is **not** a clinical or diagnostic tool. It can produce incorrect or
44
+ “hallucinated” answers, particularly on out‑of‑distribution inputs, and its numeric predictions may suffer digit‑level errors. Confidence
45
+ estimates require post‑hoc calibration. Users should always validate critical outputs against experiments or specialized bioinformatics
46
+ pipelines.
47
+
48
+ ## How to use
49
 
50
  Until its next release, the transformers library needs to be installed from source with the following command in order to use the models.
51
  PyTorch should also be installed.
 
56
  ```
57
 
58
  A small snippet of code is given here in order to **generate ChatNT answers from a pipeline (high-level)**.
59
+ - The prompt used for training ChatNT is already incorporated inside the pipeline and is the following:
60
+ "A chat between a curious user and an artificial intelligence assistant that can handle bio sequences. The assistant gives helpful,
61
+ detailed, and polite answers to the user's questions."
62
 
63
  ```
64
  # Load pipeline
 
92
  bio_tokenizer = AutoTokenizer.from_pretrained("InstaDeepAI/ChatNT", subfolder="bio_tokenizer")
93
 
94
  # Define custom inputs (note that the number of <DNA> token in the english sequence must be equal to len(dna_sequences))
95
+ # Here the english sequence should include the prompt
96
  english_sequence = "A chat between a curious user and an artificial intelligence assistant that can handle bio sequences. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: Is there any evidence of an acceptor splice site in this sequence <DNA> ?"
97
  dna_sequences = ["ATCGGAAAAAGATCCAGAAAGTTATACCAGGCCAATGGGAATCACCTATTACGTGGATAATAGCGATAGTATGTTACCTATAAATTTAACTACGTGGATATCAGGCAGTTACGTTACCAGTCAAGGAGCACCCAAAACTGTCCAGCAACAAGTTAATTTACCCATGAAGATGTACTGCAAGCCTTGCCAACCAGTTAAAGTAGCTACTCATAAGGTAATAAACAGTAATATCGACTTTTTATCCATTTTGATAATTGATTTATAACAGTCTATAACTGATCGCTCTACATAATCTCTATCAGATTACTATTGACACAAACAGAAACCCCGTTAATTTGTATGATATATTTCCCGGTAAGCTTCGATTTTTAATCCTATCGTGACAATTTGGAATGTAACTTATTTCGTATAGGATAAACTAATTTACACGTTTGAATTCCTAGAATATGGAGAATCTAAAGGTCCTGGCAATGCCATCGGCTTTCAATATTATAATGGACCAAAAGTTACTCTATTAGCTTCCAAAACTTCGCGTGAGTACATTAGAACAGAAGAATAACCTTCAATATCGAGAGAGTTACTATCACTAACTATCCTATG"]
98