marisming commited on
Commit
e92fbc8
·
1 Parent(s): 13b734f
Files changed (1) hide show
  1. README.md +59 -0
README.md CHANGED
@@ -1,3 +1,62 @@
1
  ---
2
  license: apache-2.0
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
  ---
4
+ DNAGPT2- The Best Beginner's Guide to Gene Sequence Large Language Models
5
+
6
+ ### 1. Overview
7
+ Large language models have long transcended the NLP research domain, becoming a cornerstone for AI in science. Gene sequences in bioinformatics are most similar to natural language, making the application of large models to biological sequence studies a hot research direction in recent years. The 2024 Nobel Prize in Chemistry awarded to AlphaFold for predicting protein structures has further illuminated the future path for biological research.
8
+
9
+ However, for most biologists, large models remain unfamiliar territory. Until 2023, models like GPT were niche topics within NLP research, only gaining public attention due to the emergence of ChatGPT.
10
+
11
+ Most biology + large model research has emerged post-2023, but the significant interdisciplinary gap means these studies are typically collaborative efforts by large companies and teams. Replicating or learning from this work is challenging for many researchers, as evidenced by the issues sections of top papers on GitHub.
12
+
13
+ On one hand, large models are almost certain to shape the future of biological research; on the other, many researchers hesitate at the threshold of large models. Providing a bridge over this gap is thus an urgent need.
14
+
15
+ DNAGTP2 serves as this bridge, aiming to facilitate more biologists in overcoming the large model barrier and leveraging these powerful tools to advance their work.
16
+
17
+ ### 2. Tutorial Characteristics
18
+ This tutorial is characterized by:
19
+
20
+ 1. **Simplicity**: Simple code entirely built using Hugging Face’s standard libraries.
21
+ 2. **Simplicity**: Basic theoretical explanations with full visual aids.
22
+ 3. **Simplicity**: Classic paper cases that are easy to understand.
23
+
24
+ Despite its simplicity, the tutorial covers comprehensive content, from building tokenizers to constructing GPT, BERT models from scratch, fine-tuning LLaMA models, basic DeepSpeed multi-GPU distributed training, and applying SOTA models like LucaOne and ESM3. It combines typical biological tasks such as sequence classification, structure prediction, and regression analysis, progressively unfolding.
25
+
26
+ ### Target Audience:
27
+ 1. Researchers and students in the field of biology, especially bioinformatics.
28
+ 2. Beginners in large model learning, applicable beyond just biology.
29
+
30
+ ### 3. Tutorial Outline
31
+ #### 1 Data and Environment
32
+ 1.1 Introduction to Large Model Runtime Environments
33
+ 1.2 Pre-trained and Fine-tuning Data Related to Genes
34
+ 1.3 Basic Usage of Datasets Library
35
+
36
+ #### 2 Building DNA GPT2/Bert Large Models from Scratch
37
+ 2.1 Building DNA Tokenizer
38
+ 2.2 Training DNA GPT2 Model from Scratch
39
+ 2.3 Training DNA Bert Model from Scratch
40
+ 2.4 Feature Extraction for Biological Sequences Using Gene Large Models
41
+ 2.5 Building Large Models Based on Multimodal Data
42
+
43
+ #### 3 Biological Sequence Tasks Using Gene Large Models
44
+ 3.1 Sequence Classification Task
45
+ 3.2 Structure Prediction Task
46
+ 3.3 Multi-sequence Interaction Analysis
47
+ 3.4 Function Prediction Task
48
+ 3.5 Regression Tasks
49
+
50
+ #### 4 Entering the ChatGPT Era: Gene Instruction Building and Fine-tuning
51
+ 4.1 Expanding LLaMA Vocabulary Based on Gene Data
52
+ 4.2 Introduction to DeepSpeed Distributed Training
53
+ 4.3 Continuous Pre-training of LLaMA Model Based on Gene Data
54
+ 4.4 Classification Task Using LLaMA-gene Large Model
55
+ 4.5 Instruction Fine-tuning Based on LLaMA-gene Large Model
56
+
57
+ #### 5 Overview of SOTA Large Model Applications in Biology
58
+ 5.1 Application of DNABERT2
59
+ 5.2 Usage of LucaOne
60
+ 5.3 Usage of ESM3
61
+ 5.4 Application of MedGPT
62
+ 5.5 Application of LLaMA-gene