Update README.md
Browse files
README.md
CHANGED
@@ -30,6 +30,7 @@ an autoregressive fashion, using low‑temperature sampling to produce classific
|
|
30 |
### Training Data
|
31 |
ChatNT was instruction‑tuned on a unified corpus covering 27 diverse tasks from DNA, RNA and proteins, spanning multiple species, tissues and biological processes.
|
32 |
This amounted to 605 million DNA tokens (≈ 3.6 billion bases) and 273 million English tokens, sampled uniformly over tasks for 2 billion instruction tokens.
|
|
|
33 |
|
34 |
### Tokenization
|
35 |
DNA inputs are broken into overlapping 6‑mer tokens and padded or truncated to 2048 tokens (~ 12 kb). English prompts and
|
|
|
30 |
### Training Data
|
31 |
ChatNT was instruction‑tuned on a unified corpus covering 27 diverse tasks from DNA, RNA and proteins, spanning multiple species, tissues and biological processes.
|
32 |
This amounted to 605 million DNA tokens (≈ 3.6 billion bases) and 273 million English tokens, sampled uniformly over tasks for 2 billion instruction tokens.
|
33 |
+
Examples of questions and sequences for each task, as well as additional task information, can be found in [Datasets_overview.csv](Datasets_overview.csv).
|
34 |
|
35 |
### Tokenization
|
36 |
DNA inputs are broken into overlapping 6‑mer tokens and padded or truncated to 2048 tokens (~ 12 kb). English prompts and
|