marisming commited on
Commit
83f9751
·
verified ·
1 Parent(s): 9b90d8f

Upload folder using huggingface_hub

Browse files
.gitattributes CHANGED
@@ -33,3 +33,5 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ *.psd filter=lfs diff=lfs merge=lfs -text
37
+ *.txt filter=lfs diff=lfs merge=lfs -text
.ipynb_checkpoints/README-checkpoint.md ADDED
@@ -0,0 +1,62 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ ---
4
+ DNAGPT2- The Best Beginner's Guide to Gene Sequence Large Language Models
5
+
6
+ ### 1. Overview
7
+ Large language models have long transcended the NLP research domain, becoming a cornerstone for AI in science. Gene sequences in bioinformatics are most similar to natural language, making the application of large models to biological sequence studies a hot research direction in recent years. The 2024 Nobel Prize in Chemistry awarded to AlphaFold for predicting protein structures has further illuminated the future path for biological research.
8
+
9
+ However, for most biologists, large models remain unfamiliar territory. Until 2023, models like GPT were niche topics within NLP research, only gaining public attention due to the emergence of ChatGPT.
10
+
11
+ Most biology + large model research has emerged post-2023, but the significant interdisciplinary gap means these studies are typically collaborative efforts by large companies and teams. Replicating or learning from this work is challenging for many researchers, as evidenced by the issues sections of top papers on GitHub.
12
+
13
+ On one hand, large models are almost certain to shape the future of biological research; on the other, many researchers hesitate at the threshold of large models. Providing a bridge over this gap is thus an urgent need.
14
+
15
+ DNAGTP2 serves as this bridge, aiming to facilitate more biologists in overcoming the large model barrier and leveraging these powerful tools to advance their work.
16
+
17
+ ### 2. Tutorial Characteristics
18
+ This tutorial is characterized by:
19
+
20
+ 1. **Simplicity**: Simple code entirely built using Hugging Face’s standard libraries.
21
+ 2. **Simplicity**: Basic theoretical explanations with full visual aids.
22
+ 3. **Simplicity**: Classic paper cases that are easy to understand.
23
+
24
+ Despite its simplicity, the tutorial covers comprehensive content, from building tokenizers to constructing GPT, BERT models from scratch, fine-tuning LLaMA models, basic DeepSpeed multi-GPU distributed training, and applying SOTA models like LucaOne and ESM3. It combines typical biological tasks such as sequence classification, structure prediction, and regression analysis, progressively unfolding.
25
+
26
+ ### Target Audience:
27
+ 1. Researchers and students in the field of biology, especially bioinformatics.
28
+ 2. Beginners in large model learning, applicable beyond just biology.
29
+
30
+ ### 3. Tutorial Outline
31
+ #### 1 Data and Environment
32
+ 1.1 Introduction to Large Model Runtime Environments
33
+ 1.2 Pre-trained and Fine-tuning Data Related to Genes
34
+ 1.3 Basic Usage of Datasets Library
35
+
36
+ #### 2 Building DNA GPT2/Bert Large Models from Scratch
37
+ 2.1 Building DNA Tokenizer
38
+ 2.2 Training DNA GPT2 Model from Scratch
39
+ 2.3 Training DNA Bert Model from Scratch
40
+ 2.4 Feature Extraction for Biological Sequences Using Gene Large Models
41
+ 2.5 Building Large Models Based on Multimodal Data
42
+
43
+ #### 3 Biological Sequence Tasks Using Gene Large Models
44
+ 3.1 Sequence Classification Task
45
+ 3.2 Structure Prediction Task
46
+ 3.3 Multi-sequence Interaction Analysis
47
+ 3.4 Function Prediction Task
48
+ 3.5 Regression Tasks
49
+
50
+ #### 4 Entering the ChatGPT Era: Gene Instruction Building and Fine-tuning
51
+ 4.1 Expanding LLaMA Vocabulary Based on Gene Data
52
+ 4.2 Introduction to DeepSpeed Distributed Training
53
+ 4.3 Continuous Pre-training of LLaMA Model Based on Gene Data
54
+ 4.4 Classification Task Using LLaMA-gene Large Model
55
+ 4.5 Instruction Fine-tuning Based on LLaMA-gene Large Model
56
+
57
+ #### 5 Overview of SOTA Large Model Applications in Biology
58
+ 5.1 Application of DNABERT2
59
+ 5.2 Usage of LucaOne
60
+ 5.3 Usage of ESM3
61
+ 5.4 Application of MedGPT
62
+ 5.5 Application of LLaMA-gene
.ipynb_checkpoints/env_ini-checkpoint.ipynb ADDED
@@ -0,0 +1,267 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "code",
5
+ "execution_count": 5,
6
+ "id": "e3fbdac5-cd38-4e41-b5d2-d9d112b4ac1b",
7
+ "metadata": {
8
+ "scrolled": true
9
+ },
10
+ "outputs": [
11
+ {
12
+ "name": "stdout",
13
+ "output_type": "stream",
14
+ "text": [
15
+ "Looking in indexes: http://mirrors.aliyun.com/pypi/simple\n",
16
+ "Requirement already satisfied: transformers in /root/miniconda3/lib/python3.12/site-packages (4.47.1)\n",
17
+ "Requirement already satisfied: sentencepiece in /root/miniconda3/lib/python3.12/site-packages (0.2.0)\n",
18
+ "Requirement already satisfied: google in /root/miniconda3/lib/python3.12/site-packages (3.0.0)\n",
19
+ "Requirement already satisfied: protobuf in /root/miniconda3/lib/python3.12/site-packages (5.27.0)\n",
20
+ "Requirement already satisfied: deepspeed in /root/miniconda3/lib/python3.12/site-packages (0.16.2)\n",
21
+ "Requirement already satisfied: peft in /root/miniconda3/lib/python3.12/site-packages (0.14.0)\n",
22
+ "Collecting datasets\n",
23
+ " Downloading http://mirrors.aliyun.com/pypi/packages/d7/84/0df6c5981f5fc722381662ff8cfbdf8aad64bec875f75d80b55bfef394ce/datasets-3.2.0-py3-none-any.whl (480 kB)\n",
24
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m480.6/480.6 kB\u001b[0m \u001b[31m4.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m00:01\u001b[0m\n",
25
+ "\u001b[?25hRequirement already satisfied: filelock in /root/miniconda3/lib/python3.12/site-packages (from transformers) (3.14.0)\n",
26
+ "Requirement already satisfied: huggingface-hub<1.0,>=0.24.0 in /root/miniconda3/lib/python3.12/site-packages (from transformers) (0.27.0)\n",
27
+ "Requirement already satisfied: numpy>=1.17 in /root/miniconda3/lib/python3.12/site-packages (from transformers) (1.26.4)\n",
28
+ "Requirement already satisfied: packaging>=20.0 in /root/miniconda3/lib/python3.12/site-packages (from transformers) (23.2)\n",
29
+ "Requirement already satisfied: pyyaml>=5.1 in /root/miniconda3/lib/python3.12/site-packages (from transformers) (6.0.1)\n",
30
+ "Requirement already satisfied: regex!=2019.12.17 in /root/miniconda3/lib/python3.12/site-packages (from transformers) (2024.11.6)\n",
31
+ "Requirement already satisfied: requests in /root/miniconda3/lib/python3.12/site-packages (from transformers) (2.31.0)\n",
32
+ "Requirement already satisfied: tokenizers<0.22,>=0.21 in /root/miniconda3/lib/python3.12/site-packages (from transformers) (0.21.0)\n",
33
+ "Requirement already satisfied: safetensors>=0.4.1 in /root/miniconda3/lib/python3.12/site-packages (from transformers) (0.4.5)\n",
34
+ "Requirement already satisfied: tqdm>=4.27 in /root/miniconda3/lib/python3.12/site-packages (from transformers) (4.66.2)\n",
35
+ "Requirement already satisfied: beautifulsoup4 in /root/miniconda3/lib/python3.12/site-packages (from google) (4.12.3)\n",
36
+ "Requirement already satisfied: einops in /root/miniconda3/lib/python3.12/site-packages (from deepspeed) (0.8.0)\n",
37
+ "Requirement already satisfied: hjson in /root/miniconda3/lib/python3.12/site-packages (from deepspeed) (3.1.0)\n",
38
+ "Requirement already satisfied: msgpack in /root/miniconda3/lib/python3.12/site-packages (from deepspeed) (1.1.0)\n",
39
+ "Requirement already satisfied: ninja in /root/miniconda3/lib/python3.12/site-packages (from deepspeed) (1.11.1.3)\n",
40
+ "Requirement already satisfied: psutil in /root/miniconda3/lib/python3.12/site-packages (from deepspeed) (5.9.8)\n",
41
+ "Requirement already satisfied: py-cpuinfo in /root/miniconda3/lib/python3.12/site-packages (from deepspeed) (9.0.0)\n",
42
+ "Requirement already satisfied: pydantic>=2.0.0 in /root/miniconda3/lib/python3.12/site-packages (from deepspeed) (2.10.4)\n",
43
+ "Requirement already satisfied: torch in /root/miniconda3/lib/python3.12/site-packages (from deepspeed) (2.3.0+cu121)\n",
44
+ "Requirement already satisfied: nvidia-ml-py in /root/miniconda3/lib/python3.12/site-packages (from deepspeed) (12.560.30)\n",
45
+ "Requirement already satisfied: accelerate>=0.21.0 in /root/miniconda3/lib/python3.12/site-packages (from peft) (1.2.1)\n",
46
+ "Collecting pyarrow>=15.0.0 (from datasets)\n",
47
+ " Downloading http://mirrors.aliyun.com/pypi/packages/3a/2e/3b99f8a3d9e0ccae0e961978a0d0089b25fb46ebbcfb5ebae3cca179a5b3/pyarrow-18.1.0-cp312-cp312-manylinux_2_28_x86_64.whl (40.1 MB)\n",
48
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m40.1/40.1 MB\u001b[0m \u001b[31m14.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m00:01\u001b[0m\n",
49
+ "\u001b[?25hCollecting dill<0.3.9,>=0.3.0 (from datasets)\n",
50
+ " Downloading http://mirrors.aliyun.com/pypi/packages/c9/7a/cef76fd8438a42f96db64ddaa85280485a9c395e7df3db8158cfec1eee34/dill-0.3.8-py3-none-any.whl (116 kB)\n",
51
+ "\u001b[2K \u001b[90m━━━━━━━���━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m116.3/116.3 kB\u001b[0m \u001b[31m53.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
52
+ "\u001b[?25hCollecting pandas (from datasets)\n",
53
+ " Downloading http://mirrors.aliyun.com/pypi/packages/38/f8/d8fddee9ed0d0c0f4a2132c1dfcf0e3e53265055da8df952a53e7eaf178c/pandas-2.2.3-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.7 MB)\n",
54
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m12.7/12.7 MB\u001b[0m \u001b[31m13.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m00:01\u001b[0m\n",
55
+ "\u001b[?25hCollecting requests (from transformers)\n",
56
+ " Downloading http://mirrors.aliyun.com/pypi/packages/f9/9b/335f9764261e915ed497fcdeb11df5dfd6f7bf257d4a6a2a686d80da4d54/requests-2.32.3-py3-none-any.whl (64 kB)\n",
57
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m64.9/64.9 kB\u001b[0m \u001b[31m31.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
58
+ "\u001b[?25hCollecting tqdm>=4.27 (from transformers)\n",
59
+ " Downloading http://mirrors.aliyun.com/pypi/packages/d0/30/dc54f88dd4a2b5dc8a0279bdd7270e735851848b762aeb1c1184ed1f6b14/tqdm-4.67.1-py3-none-any.whl (78 kB)\n",
60
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m78.5/78.5 kB\u001b[0m \u001b[31m35.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
61
+ "\u001b[?25hCollecting xxhash (from datasets)\n",
62
+ " Downloading http://mirrors.aliyun.com/pypi/packages/11/a7/81dba5010f7e733de88af9555725146fc133be97ce36533867f4c7e75066/xxhash-3.5.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (194 kB)\n",
63
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m194.4/194.4 kB\u001b[0m \u001b[31m6.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
64
+ "\u001b[?25hCollecting multiprocess<0.70.17 (from datasets)\n",
65
+ " Downloading http://mirrors.aliyun.com/pypi/packages/0a/7d/a988f258104dcd2ccf1ed40fdc97e26c4ac351eeaf81d76e266c52d84e2f/multiprocess-0.70.16-py312-none-any.whl (146 kB)\n",
66
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m146.7/146.7 kB\u001b[0m \u001b[31m4.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
67
+ "\u001b[?25hRequirement already satisfied: fsspec<=2024.9.0,>=2023.1.0 in /root/miniconda3/lib/python3.12/site-packages (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets) (2024.5.0)\n",
68
+ "Collecting aiohttp (from datasets)\n",
69
+ " Downloading http://mirrors.aliyun.com/pypi/packages/40/7f/6de218084f9b653026bd7063cd8045123a7ba90c25176465f266976d8c82/aiohttp-3.11.11-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.7 MB)\n",
70
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m1.7/1.7 MB\u001b[0m \u001b[31m16.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m00:01\u001b[0m\n",
71
+ "\u001b[?25hCollecting aiohappyeyeballs>=2.3.0 (from aiohttp->datasets)\n",
72
+ " Downloading http://mirrors.aliyun.com/pypi/packages/b9/74/fbb6559de3607b3300b9be3cc64e97548d55678e44623db17820dbd20002/aiohappyeyeballs-2.4.4-py3-none-any.whl (14 kB)\n",
73
+ "Collecting aiosignal>=1.1.2 (from aiohttp->datasets)\n",
74
+ " Downloading http://mirrors.aliyun.com/pypi/packages/ec/6a/bc7e17a3e87a2985d3e8f4da4cd0f481060eb78fb08596c42be62c90a4d9/aiosignal-1.3.2-py2.py3-none-any.whl (7.6 kB)\n",
75
+ "Requirement already satisfied: attrs>=17.3.0 in /root/miniconda3/lib/python3.12/site-packages (from aiohttp->datasets) (23.2.0)\n",
76
+ "Collecting frozenlist>=1.1.1 (from aiohttp->datasets)\n",
77
+ " Downloading http://mirrors.aliyun.com/pypi/packages/af/f2/64b73a9bb86f5a89fb55450e97cd5c1f84a862d4ff90d9fd1a73ab0f64a5/frozenlist-1.5.0-cp312-cp312-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (283 kB)\n",
78
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m283.6/283.6 kB\u001b[0m \u001b[31m41.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
79
+ "\u001b[?25hCollecting multidict<7.0,>=4.5 (from aiohttp->datasets)\n",
80
+ " Downloading http://mirrors.aliyun.com/pypi/packages/d3/c8/529101d7176fe7dfe1d99604e48d69c5dfdcadb4f06561f465c8ef12b4df/multidict-6.1.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (131 kB)\n",
81
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━��━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m131.0/131.0 kB\u001b[0m \u001b[31m56.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
82
+ "\u001b[?25hCollecting propcache>=0.2.0 (from aiohttp->datasets)\n",
83
+ " Downloading http://mirrors.aliyun.com/pypi/packages/1c/07/ebe102777a830bca91bbb93e3479cd34c2ca5d0361b83be9dbd93104865e/propcache-0.2.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (243 kB)\n",
84
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m243.6/243.6 kB\u001b[0m \u001b[31m41.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
85
+ "\u001b[?25hCollecting yarl<2.0,>=1.17.0 (from aiohttp->datasets)\n",
86
+ " Downloading http://mirrors.aliyun.com/pypi/packages/1a/e1/a097d5755d3ea8479a42856f51d97eeff7a3a7160593332d98f2709b3580/yarl-1.18.3-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (336 kB)\n",
87
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m336.9/336.9 kB\u001b[0m \u001b[31m41.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
88
+ "\u001b[?25hRequirement already satisfied: typing-extensions>=3.7.4.3 in /root/miniconda3/lib/python3.12/site-packages (from huggingface-hub<1.0,>=0.24.0->transformers) (4.12.2)\n",
89
+ "Requirement already satisfied: annotated-types>=0.6.0 in /root/miniconda3/lib/python3.12/site-packages (from pydantic>=2.0.0->deepspeed) (0.7.0)\n",
90
+ "Requirement already satisfied: pydantic-core==2.27.2 in /root/miniconda3/lib/python3.12/site-packages (from pydantic>=2.0.0->deepspeed) (2.27.2)\n",
91
+ "Requirement already satisfied: charset-normalizer<4,>=2 in /root/miniconda3/lib/python3.12/site-packages (from requests->transformers) (2.0.4)\n",
92
+ "Requirement already satisfied: idna<4,>=2.5 in /root/miniconda3/lib/python3.12/site-packages (from requests->transformers) (3.7)\n",
93
+ "Requirement already satisfied: urllib3<3,>=1.21.1 in /root/miniconda3/lib/python3.12/site-packages (from requests->transformers) (2.1.0)\n",
94
+ "Requirement already satisfied: certifi>=2017.4.17 in /root/miniconda3/lib/python3.12/site-packages (from requests->transformers) (2024.2.2)\n",
95
+ "Requirement already satisfied: sympy in /root/miniconda3/lib/python3.12/site-packages (from torch->deepspeed) (1.12.1)\n",
96
+ "Requirement already satisfied: networkx in /root/miniconda3/lib/python3.12/site-packages (from torch->deepspeed) (3.3)\n",
97
+ "Requirement already satisfied: jinja2 in /root/miniconda3/lib/python3.12/site-packages (from torch->deepspeed) (3.1.4)\n",
98
+ "Requirement already satisfied: nvidia-cuda-nvrtc-cu12==12.1.105 in /root/miniconda3/lib/python3.12/site-packages (from torch->deepspeed) (12.1.105)\n",
99
+ "Requirement already satisfied: nvidia-cuda-runtime-cu12==12.1.105 in /root/miniconda3/lib/python3.12/site-packages (from torch->deepspeed) (12.1.105)\n",
100
+ "Requirement already satisfied: nvidia-cuda-cupti-cu12==12.1.105 in /root/miniconda3/lib/python3.12/site-packages (from torch->deepspeed) (12.1.105)\n",
101
+ "Requirement already satisfied: nvidia-cudnn-cu12==8.9.2.26 in /root/miniconda3/lib/python3.12/site-packages (from torch->deepspeed) (8.9.2.26)\n",
102
+ "Requirement already satisfied: nvidia-cublas-cu12==12.1.3.1 in /root/miniconda3/lib/python3.12/site-packages (from torch->deepspeed) (12.1.3.1)\n",
103
+ "Requirement already satisfied: nvidia-cufft-cu12==11.0.2.54 in /root/miniconda3/lib/python3.12/site-packages (from torch->deepspeed) (11.0.2.54)\n",
104
+ "Requirement already satisfied: nvidia-curand-cu12==10.3.2.106 in /root/miniconda3/lib/python3.12/site-packages (from torch->deepspeed) (10.3.2.106)\n",
105
+ "Requirement already satisfied: nvidia-cusolver-cu12==11.4.5.107 in /root/miniconda3/lib/python3.12/site-packages (from torch->deepspeed) (11.4.5.107)\n",
106
+ "Requirement already satisfied: nvidia-cusparse-cu12==12.1.0.106 in /root/miniconda3/lib/python3.12/site-packages (from torch->deepspeed) (12.1.0.106)\n",
107
+ "Requirement already satisfied: nvidia-nccl-cu12==2.20.5 in /root/miniconda3/lib/python3.12/site-packages (from torch->deepspeed) (2.20.5)\n",
108
+ "Requirement already satisfied: nvidia-nvtx-cu12==12.1.105 in /root/miniconda3/lib/python3.12/site-packages (from torch->deepspeed) (12.1.105)\n",
109
+ "Requirement already satisfied: nvidia-nvjitlink-cu12 in /root/miniconda3/lib/python3.12/site-packages (from nvidia-cusolver-cu12==11.4.5.107->torch->deepspeed) (12.5.40)\n",
110
+ "Requirement already satisfied: soupsieve>1.2 in /root/miniconda3/lib/python3.12/site-packages (from beautifulsoup4->google) (2.5)\n",
111
+ "Requirement already satisfied: python-dateutil>=2.8.2 in /root/miniconda3/lib/python3.12/site-packages (from pandas->datasets) (2.9.0.post0)\n",
112
+ "Collecting pytz>=2020.1 (from pandas->datasets)\n",
113
+ " Downloading http://mirrors.aliyun.com/pypi/packages/11/c3/005fcca25ce078d2cc29fd559379817424e94885510568bc1bc53d7d5846/pytz-2024.2-py2.py3-none-any.whl (508 kB)\n",
114
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m508.0/508.0 kB\u001b[0m \u001b[31m38.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
115
+ "\u001b[?25hCollecting tzdata>=2022.7 (from pandas->datasets)\n",
116
+ " Downloading http://mirrors.aliyun.com/pypi/packages/a6/ab/7e5f53c3b9d14972843a647d8d7a853969a58aecc7559cb3267302c94774/tzdata-2024.2-py2.py3-none-any.whl (346 kB)\n",
117
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m346.6/346.6 kB\u001b[0m \u001b[31m36.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
118
+ "\u001b[?25hRequirement already satisfied: six>=1.5 in /root/miniconda3/lib/python3.12/site-packages (from python-dateutil>=2.8.2->pandas->datasets) (1.16.0)\n",
119
+ "Requirement already satisfied: MarkupSafe>=2.0 in /root/miniconda3/lib/python3.12/site-packages (from jinja2->torch->deepspeed) (2.1.5)\n",
120
+ "Requirement already satisfied: mpmath<1.4.0,>=1.1.0 in /root/miniconda3/lib/python3.12/site-packages (from sympy->torch->deepspeed) (1.3.0)\n",
121
+ "Installing collected packages: pytz, xxhash, tzdata, tqdm, requests, pyarrow, propcache, multidict, frozenlist, dill, aiohappyeyeballs, yarl, pandas, multiprocess, aiosignal, aiohttp, datasets\n",
122
+ " Attempting uninstall: tqdm\n",
123
+ " Found existing installation: tqdm 4.66.2\n",
124
+ " Uninstalling tqdm-4.66.2:\n",
125
+ " Successfully uninstalled tqdm-4.66.2\n",
126
+ " Attempting uninstall: requests\n",
127
+ " Found existing installation: requests 2.31.0\n",
128
+ " Uninstalling requests-2.31.0:\n",
129
+ " Successfully uninstalled requests-2.31.0\n",
130
+ "Successfully installed aiohappyeyeballs-2.4.4 aiohttp-3.11.11 aiosignal-1.3.2 datasets-3.2.0 dill-0.3.8 frozenlist-1.5.0 multidict-6.1.0 multiprocess-0.70.16 pandas-2.2.3 propcache-0.2.1 pyarrow-18.1.0 pytz-2024.2 requests-2.32.3 tqdm-4.67.1 tzdata-2024.2 xxhash-3.5.0 yarl-1.18.3\n",
131
+ "\u001b[33mWARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv\u001b[0m\u001b[33m\n",
132
+ "\u001b[0m"
133
+ ]
134
+ }
135
+ ],
136
+ "source": [
137
+ "!pip install transformers sentencepiece google protobuf deepspeed peft datasets "
138
+ ]
139
+ },
140
+ {
141
+ "cell_type": "code",
142
+ "execution_count": 9,
143
+ "id": "4e906370-40c7-4f6b-a700-f183a9308c78",
144
+ "metadata": {},
145
+ "outputs": [
146
+ {
147
+ "name": "stdout",
148
+ "output_type": "stream",
149
+ "text": [
150
+ "https://hf-mirror.com\n"
151
+ ]
152
+ }
153
+ ],
154
+ "source": [
155
+ "import os\n",
156
+ "\n",
157
+ "# 设置环境变量, autodl专区 其他idc\n",
158
+ "os.environ['HF_ENDPOINT'] = 'https://hf-mirror.com'\n",
159
+ "\n",
160
+ "# 打印环境变量以确认设置成功\n",
161
+ "print(os.environ.get('HF_ENDPOINT'))"
162
+ ]
163
+ },
164
+ {
165
+ "cell_type": "code",
166
+ "execution_count": 1,
167
+ "id": "ecc98529-6581-41d2-a876-23ce5187cae7",
168
+ "metadata": {},
169
+ "outputs": [],
170
+ "source": [
171
+ "import subprocess\n",
172
+ "import os\n",
173
+ "# 设置环境变量, autodl一般区域\n",
174
+ "result = subprocess.run('bash -c \"source /etc/network_turbo && env | grep proxy\"', shell=True, capture_output=True, text=True)\n",
175
+ "output = result.stdout\n",
176
+ "for line in output.splitlines():\n",
177
+ " if '=' in line:\n",
178
+ " var, value = line.split('=', 1)\n",
179
+ " os.environ[var] = value"
180
+ ]
181
+ },
182
+ {
183
+ "cell_type": "code",
184
+ "execution_count": 2,
185
+ "id": "b01fc372-33af-46e5-8c0e-8bccba7237ee",
186
+ "metadata": {},
187
+ "outputs": [],
188
+ "source": [
189
+ "from datasets import load_dataset\n",
190
+ "# load ~11k samples from promoters prediction dataset\n",
191
+ "dataset = load_dataset(\"dnagpt/dna_core_promoter\")['train'].train_test_split(test_size=0.1)"
192
+ ]
193
+ },
194
+ {
195
+ "cell_type": "code",
196
+ "execution_count": 3,
197
+ "id": "136c38d4-bd0f-4ecd-9165-2fd5b5207c1d",
198
+ "metadata": {},
199
+ "outputs": [
200
+ {
201
+ "data": {
202
+ "text/plain": [
203
+ "DatasetDict({\n",
204
+ " train: Dataset({\n",
205
+ " features: ['sequence', 'label'],\n",
206
+ " num_rows: 53276\n",
207
+ " })\n",
208
+ " test: Dataset({\n",
209
+ " features: ['sequence', 'label'],\n",
210
+ " num_rows: 5920\n",
211
+ " })\n",
212
+ "})"
213
+ ]
214
+ },
215
+ "execution_count": 3,
216
+ "metadata": {},
217
+ "output_type": "execute_result"
218
+ }
219
+ ],
220
+ "source": [
221
+ "dataset"
222
+ ]
223
+ },
224
+ {
225
+ "cell_type": "markdown",
226
+ "id": "28acb64e-8d1e-4482-a515-344a2e0344c4",
227
+ "metadata": {},
228
+ "source": [
229
+ "## lfs 支持\n",
230
+ "apt-get update\n",
231
+ "\n",
232
+ "apt-get install git-lfs\n",
233
+ "\n",
234
+ "git lfs install"
235
+ ]
236
+ },
237
+ {
238
+ "cell_type": "code",
239
+ "execution_count": null,
240
+ "id": "3d3cefb0-1eed-4f23-8591-1990f7113820",
241
+ "metadata": {},
242
+ "outputs": [],
243
+ "source": []
244
+ }
245
+ ],
246
+ "metadata": {
247
+ "kernelspec": {
248
+ "display_name": "Python 3 (ipykernel)",
249
+ "language": "python",
250
+ "name": "python3"
251
+ },
252
+ "language_info": {
253
+ "codemirror_mode": {
254
+ "name": "ipython",
255
+ "version": 3
256
+ },
257
+ "file_extension": ".py",
258
+ "mimetype": "text/x-python",
259
+ "name": "python",
260
+ "nbconvert_exporter": "python",
261
+ "pygments_lexer": "ipython3",
262
+ "version": "3.12.3"
263
+ }
264
+ },
265
+ "nbformat": 4,
266
+ "nbformat_minor": 5
267
+ }
.ipynb_checkpoints/lecture_intro-checkpoint.ipynb ADDED
@@ -0,0 +1,33 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "code",
5
+ "execution_count": null,
6
+ "id": "1e2c218d-3768-4fb6-9ccd-e7072558a3fa",
7
+ "metadata": {},
8
+ "outputs": [],
9
+ "source": []
10
+ }
11
+ ],
12
+ "metadata": {
13
+ "kernelspec": {
14
+ "display_name": "Python 3 (ipykernel)",
15
+ "language": "python",
16
+ "name": "python3"
17
+ },
18
+ "language_info": {
19
+ "codemirror_mode": {
20
+ "name": "ipython",
21
+ "version": 3
22
+ },
23
+ "file_extension": ".py",
24
+ "mimetype": "text/x-python",
25
+ "name": "python",
26
+ "nbconvert_exporter": "python",
27
+ "pygments_lexer": "ipython3",
28
+ "version": "3.12.3"
29
+ }
30
+ },
31
+ "nbformat": 4,
32
+ "nbformat_minor": 5
33
+ }
.ipynb_checkpoints/lecture_intro_cn-checkpoint.ipynb ADDED
@@ -0,0 +1,137 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "markdown",
5
+ "id": "2365faf7-39fb-4e53-a810-2e28c4f6b4c1",
6
+ "metadata": {},
7
+ "source": [
8
+ "# DNAGTP2-基因序列大模型最佳入门1\n",
9
+ "\n",
10
+ "## 1 概要\n",
11
+ "自然语言大模型早已超出NLP研究领域,正在成为AI for science的基石。生物信息学中的基因序列,则是和自然语言最类似的,把大模型应用于生物序列研究,就成了最近一两年的热门研究方向,特别是2024年预测蛋白质结构的alphaFold获得诺贝尔化学奖,更是为生物学的研究指明了未来的方向。\n",
12
+ "\n",
13
+ "但对大多数从事生物学研究的工作者而言,大模型又非常陌生。事实上,在2023年之前,GPT等大模型还是NLP领域研究的小众课题,只是因为chatgpt的爆发,才进入公众视野。\n",
14
+ "\n",
15
+ "而大部生物学+大模型的研究,也都是2023年之后的工作,但领域跨度过大,这些论文一般都是大公司、大团队协作的产物,大部分研究者要学习或者重现这些工作,困难重重,我们在很多top论文的github issue中,都能感受到这一点。\n",
16
+ "\n",
17
+ "一方面,言必称大模型几乎是生物学研究确定的未来,另一方面,众多生物学研究者却在大模型的门槛前徘徊不前。如何在这个门槛前加一道梯子,就成了该领域一个迫切的需求。\n",
18
+ "\n",
19
+ "DNAGTP2就是这样的梯子,仅望能抛砖引玉,让更多的生物学工作者能够越过大模型的门槛,戴上大模型的翅膀,卷过同行。\n",
20
+ "\n",
21
+ "## 2 教程特色\n",
22
+ "本教程主要有以下特色:\n",
23
+ "\n",
24
+ "1 简单。代码简单,全部代码均为huggingface标准库构建,阅后即会。\n",
25
+ "\n",
26
+ "2 简单。理论简单,只讲最基础的网络构架,全部可视化讲解。\n",
27
+ "\n",
28
+ "3 简单。案例简单,均使用经典论文的代表性案例,一看就懂。\n",
29
+ "\n",
30
+ "\n",
31
+ "\n",
32
+ "教程内容又不简单,从基础的分词器构建,到从头构建gpt、bert等典型模型,到llama模型微调,基本的deepspeed多卡分布式训练,到lucaone、ESM3等SOTA大模型的应用,结合序列分类、结构预测、回归分析等典型生物学任务,循序渐进,逐步展开。本教程会紧跟研究趋势,不断更新。\n",
33
+ "\n",
34
+ "\n",
35
+ "\n",
36
+ "本教程面向人群:\n",
37
+ "\n",
38
+ "1 生物学领域科研人员、学生等,特别是生物信息学。\n",
39
+ "\n",
40
+ "2 大模型学习入门。不仅是生物学领域的,都可以看看,和一般大模型入门没啥差别,只是数据不同。\n",
41
+ "\n",
42
+ "## 3 教程大纲\n",
43
+ "1 数据和环境\n",
44
+ "\n",
45
+ "1.1 大模型运行环境简介\n",
46
+ "\n",
47
+ "1.2 基因相关预训练和微调数据\n",
48
+ "\n",
49
+ "1.3 datasets库基本使用\n",
50
+ "\n",
51
+ "2 从头构建DNA的GPT2/Bert大模型\n",
52
+ "\n",
53
+ "\n",
54
+ "2.1 DNA分词器构建\n",
55
+ "\n",
56
+ "2.2 从头训练dna gpt2大模型\n",
57
+ "\n",
58
+ "2.3 从头训练dna bert大模型\n",
59
+ "\n",
60
+ "2.4 基因大模型的生物序列特征提取\n",
61
+ "\n",
62
+ "2.4 基于多模态数据构建大模型\n",
63
+ "\n",
64
+ "\n",
65
+ "\n",
66
+ "3 基因大模型的生物序列任务\n",
67
+ "\n",
68
+ "3.1 序列分类任务\n",
69
+ "\n",
70
+ "3.2 序列结构预测\n",
71
+ "\n",
72
+ "3.3 多序列交互作用分析\n",
73
+ "\n",
74
+ "3.4 功能预测任务\n",
75
+ "\n",
76
+ "3.5 回归类任务\n",
77
+ "\n",
78
+ "\n",
79
+ "\n",
80
+ "4 进入chatgpt时代: 基因指令构建和微调\n",
81
+ "\n",
82
+ "4.1 基于基因数据的llama词典扩充\n",
83
+ "\n",
84
+ "4.2 deepspeed分布式训练简介\n",
85
+ "\n",
86
+ "4.3 基于基因数据的llama模型持续预训练\n",
87
+ "\n",
88
+ "4.4 基于llama-gene大模型的分类任务\n",
89
+ "\n",
90
+ "4.5 基于llama-gene大模型的指令微调\n",
91
+ "\n",
92
+ "\n",
93
+ "\n",
94
+ "5 生物领域SOTA大模型应用概要\n",
95
+ "\n",
96
+ "5.1 dnabert2应用\n",
97
+ "\n",
98
+ "5.2 lucaone使用\n",
99
+ "\n",
100
+ "5.3 ESM3使用\n",
101
+ "\n",
102
+ "5.4 Medgpt应用\n",
103
+ "\n",
104
+ "5.5 llama-gene应用"
105
+ ]
106
+ },
107
+ {
108
+ "cell_type": "code",
109
+ "execution_count": null,
110
+ "id": "3252ef0f-3193-43f3-9dcf-5d2b625dbdf7",
111
+ "metadata": {},
112
+ "outputs": [],
113
+ "source": []
114
+ }
115
+ ],
116
+ "metadata": {
117
+ "kernelspec": {
118
+ "display_name": "Python 3 (ipykernel)",
119
+ "language": "python",
120
+ "name": "python3"
121
+ },
122
+ "language_info": {
123
+ "codemirror_mode": {
124
+ "name": "ipython",
125
+ "version": 3
126
+ },
127
+ "file_extension": ".py",
128
+ "mimetype": "text/x-python",
129
+ "name": "python",
130
+ "nbconvert_exporter": "python",
131
+ "pygments_lexer": "ipython3",
132
+ "version": "3.12.3"
133
+ }
134
+ },
135
+ "nbformat": 4,
136
+ "nbformat_minor": 5
137
+ }
01-data_env/.ipynb_checkpoints/1-env-intro-checkpoint.ipynb ADDED
@@ -0,0 +1,79 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "markdown",
5
+ "id": "a25f3d36-e14f-4afd-8926-32748a42e1d1",
6
+ "metadata": {},
7
+ "source": [
8
+ "# 1 大模型运行环境简介\n",
9
+ "\n",
10
+ "\n",
11
+ "建议直接使用autodl,google colab等环境\n",
12
+ "\n",
13
+ "显卡:4090或者4090d\n",
14
+ "\n",
15
+ "内存:32G至少\n",
16
+ "\n",
17
+ "torch>=2.3.0\n",
18
+ "\n",
19
+ "具体可以参考:https://zhuanlan.zhihu.com/p/13479003076\n",
20
+ "\n",
21
+ "pip安装下面的基本transformer环境即可:"
22
+ ]
23
+ },
24
+ {
25
+ "cell_type": "code",
26
+ "execution_count": null,
27
+ "id": "cdeae2e5-2a39-4370-a5ec-47780f8fa76a",
28
+ "metadata": {},
29
+ "outputs": [],
30
+ "source": [
31
+ "!pip install transformers sentencepiece google protobuf deepspeed peft datasets "
32
+ ]
33
+ },
34
+ {
35
+ "cell_type": "markdown",
36
+ "id": "a355a6e6-62fc-4b8f-ba35-b9c2f0ef48c8",
37
+ "metadata": {},
38
+ "source": [
39
+ "如要运行deepspeed,一般使用一机多卡即可,本教程一般不会涉及需要多机多卡的案例\n",
40
+ "\n",
41
+ "\n",
42
+ "推荐的gpu主机:\n",
43
+ "* autodl.com, 国内的 \n",
44
+ "* vast.ai, 海外的\n",
45
+ "\n",
46
+ "主流云平台gpu一般都特别贵,也不允许运行4090等显卡。"
47
+ ]
48
+ },
49
+ {
50
+ "cell_type": "code",
51
+ "execution_count": null,
52
+ "id": "813c37df-9fef-453a-bfbd-46dd05ac76dd",
53
+ "metadata": {},
54
+ "outputs": [],
55
+ "source": []
56
+ }
57
+ ],
58
+ "metadata": {
59
+ "kernelspec": {
60
+ "display_name": "Python 3 (ipykernel)",
61
+ "language": "python",
62
+ "name": "python3"
63
+ },
64
+ "language_info": {
65
+ "codemirror_mode": {
66
+ "name": "ipython",
67
+ "version": 3
68
+ },
69
+ "file_extension": ".py",
70
+ "mimetype": "text/x-python",
71
+ "name": "python",
72
+ "nbconvert_exporter": "python",
73
+ "pygments_lexer": "ipython3",
74
+ "version": "3.12.3"
75
+ }
76
+ },
77
+ "nbformat": 4,
78
+ "nbformat_minor": 5
79
+ }
01-data_env/.ipynb_checkpoints/2-data-intro-checkpoint.ipynb ADDED
@@ -0,0 +1,283 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "markdown",
5
+ "id": "50ff8836-7075-4858-b463-c99f973f408d",
6
+ "metadata": {},
7
+ "source": [
8
+ "# 2 基因相关预训练和微调数据"
9
+ ]
10
+ },
11
+ {
12
+ "cell_type": "markdown",
13
+ "id": "17cde5bb-70e5-437e-a4a3-193a881dd412",
14
+ "metadata": {},
15
+ "source": [
16
+ "本教程主要关注基因相关的生物序列数据,包括主要的DNA和蛋白质序列,data目录下数据如下:\n",
17
+ "\n",
18
+ "* dna_1g.txt DNA序列数据,大小1G,从glue数据集中抽取,具体可参考dnabert2的论文,包括多个模式生物的数据\n",
19
+ "* potein_1g.txt 蛋白质序列数据,大小1G,从pdb数据库中抽取\n",
20
+ "* english_500m.txt 英文数据,大小500M,就是英文百科"
21
+ ]
22
+ },
23
+ {
24
+ "cell_type": "markdown",
25
+ "id": "b45ecf27-1514-45e0-bfbd-361e6dcc98ea",
26
+ "metadata": {},
27
+ "source": [
28
+ "下面演示下huggingface的dataset库的基本用法,以及样例数据"
29
+ ]
30
+ },
31
+ {
32
+ "cell_type": "code",
33
+ "execution_count": 3,
34
+ "id": "2715f9bb-2e43-4bd6-8715-5c96d317bcf8",
35
+ "metadata": {},
36
+ "outputs": [
37
+ {
38
+ "data": {
39
+ "application/vnd.jupyter.widget-view+json": {
40
+ "model_id": "c067aeb8ab304723ac6b527e7ad6c768",
41
+ "version_major": 2,
42
+ "version_minor": 0
43
+ },
44
+ "text/plain": [
45
+ "Generating train split: 0 examples [00:00, ? examples/s]"
46
+ ]
47
+ },
48
+ "metadata": {},
49
+ "output_type": "display_data"
50
+ },
51
+ {
52
+ "data": {
53
+ "text/plain": [
54
+ "DatasetDict({\n",
55
+ " train: Dataset({\n",
56
+ " features: ['text'],\n",
57
+ " num_rows: 1079595\n",
58
+ " })\n",
59
+ "})"
60
+ ]
61
+ },
62
+ "execution_count": 3,
63
+ "metadata": {},
64
+ "output_type": "execute_result"
65
+ }
66
+ ],
67
+ "source": [
68
+ "#读取dna数据\n",
69
+ "from datasets import load_dataset\n",
70
+ "dna_dataset = load_dataset('text', data_files='data/dna_1g.txt')\n",
71
+ "dna_dataset"
72
+ ]
73
+ },
74
+ {
75
+ "cell_type": "markdown",
76
+ "id": "ec00ad72-c5f9-40db-8508-6c6bf8f374c1",
77
+ "metadata": {},
78
+ "source": [
79
+ "\n",
80
+ "Datasets 提供了加载脚本来加载本地和远程数据集。它支持几种常见的数据格式,例如:\n",
81
+ "\n",
82
+ "| Data format | Loading script | Example |\n",
83
+ "|-------------------|----------------|-------------------------------------------------------------------------|\n",
84
+ "| CSV & TSV | csv | `load_dataset(\"csv\", data_files=\"my_file.csv\")` |\n",
85
+ "| Text files | text | `load_dataset(\"text\", data_files=\"my_file.txt\")` |\n",
86
+ "| JSON & JSON Lines | json | `load_dataset(\"json\", data_files=\"my_file.jsonl\")` |\n",
87
+ "| Pickled DataFrames| pandas | `load_dataset(\"pandas\", data_files=\"my_dataframe.pkl\")` |\n",
88
+ "\n",
89
+ "如表所示, 对于每种数据格式, 我们只需要使用 load_dataset() 函数, 使用 data_files 指定一个或多个文件的路径的参数。 "
90
+ ]
91
+ },
92
+ {
93
+ "cell_type": "markdown",
94
+ "id": "24c40ec7-cb59-4c3a-8052-00d7979f6208",
95
+ "metadata": {},
96
+ "source": [
97
+ "load_dataset默认加载到train下,可以把dataset当做一个一般的python dict使用"
98
+ ]
99
+ },
100
+ {
101
+ "cell_type": "code",
102
+ "execution_count": 4,
103
+ "id": "2a375409-d2b6-4648-8f6a-8ac3fb25bb75",
104
+ "metadata": {},
105
+ "outputs": [
106
+ {
107
+ "data": {
108
+ "text/plain": [
109
+ "{'text': 'TTAAATCCTAGAAGTTGGTTACACGGGTGAGGAAAATGGTGAGAAGCCCAATGGGATGCTGTAGCAATGACAGTGAACTGCTGTCACCCCTGAGGCTGGAAAGATAACAGACATTTGCCAGGAGCTAGAAGCTGGGGCAGCCTGGTAGGAGCGAGAATATGGTGAGAGCTGCCCCCTGGGGATGGAACCACAGAGGGAGGGTCTCTCTGATGAGACATAGAGCCAAGAACAGATACAGCCATTGTGGGAGATGGTAACCAAAGCAGAGAGAGAGAGAGAGAGCGAGAGAGAGAGAAAACACCCTGGTTTCTTCCTTCCTTCCACCTTTGAGTTTCCCACCAGTGCTTCCCATTAGCCCAAACTACCAAGAACCCAGAGGGCAAAGGAGCCCGGGAAATCTAATTCTACATGATACCGAGCAAAGCCGATGTTCCAGCTGGCTGCGTCTGTTACAGTAGGTAGTCAGGCAGACATAAGCAGGGCAGGAGAGGGCTCCTCCCAACCAGGAATGTCAGGTGACGGTCAGGTGATGGTCAGGTGGTCATTAACTGTCTCTCTAAAATAATAATTGGTTACAGCCAGCACCAGGGAAAGGCAGTCTCCCAACCGATAGAAACATCTGAAACTGATGATCAGTAGCTTCCCAATAAGGTCTCAGGAGTTGGACGCATGGGCTCAGCATGAACACTGAGAGGCAAAATGGTGGAGTTTAACTGGTATATGACCTTCCTCTAGAAACATTCAGCTGGTAAGGGAAGAACGCCTTAAGCGAATATGCACGCAACTCCAGTAAACACTGTGCATGTTCCTGTCCCAATGCTGGTAGACCACTGCGCATGCAAACAGCCCACCCCAGGGAAGAATCAGGAGAGAAGAGACCCCACAAGCATGCCAACACATAAAACCCCAAGTCAGGAGTCAAACCATGCACTTGAATCAAGTCACCCACTTAGCTCTCTTTCAAGTGTATTTTACTTTCTTTCATTCCTGCTCTAAAACT'}"
110
+ ]
111
+ },
112
+ "execution_count": 4,
113
+ "metadata": {},
114
+ "output_type": "execute_result"
115
+ }
116
+ ],
117
+ "source": [
118
+ "dna_dataset[\"train\"][0]"
119
+ ]
120
+ },
121
+ {
122
+ "cell_type": "raw",
123
+ "id": "985bd82a-1ff0-49ef-968d-8d5f6df8d76f",
124
+ "metadata": {},
125
+ "source": [
126
+ "dna数据就是如上所示,由ATCG 4个字母组成的文本,对于学习大语言模型而言,可以不关注其具体的含义,当然,大部分dna序列的含义目前也都没有解读:)\n",
127
+ "\n",
128
+ "然后是蛋白质序列"
129
+ ]
130
+ },
131
+ {
132
+ "cell_type": "code",
133
+ "execution_count": 5,
134
+ "id": "94e3f443-939e-4148-bba6-6cafa90790b6",
135
+ "metadata": {},
136
+ "outputs": [
137
+ {
138
+ "data": {
139
+ "application/vnd.jupyter.widget-view+json": {
140
+ "model_id": "a1023bd5311a4a5dbe96c6c8fdc5b519",
141
+ "version_major": 2,
142
+ "version_minor": 0
143
+ },
144
+ "text/plain": [
145
+ "Generating train split: 0 examples [00:00, ? examples/s]"
146
+ ]
147
+ },
148
+ "metadata": {},
149
+ "output_type": "display_data"
150
+ },
151
+ {
152
+ "data": {
153
+ "text/plain": [
154
+ "{'text': 'MLTDPFGRTIKLRIAVTRCLCIYCHREGESDPGTEMSAERIAEIAKAFYELGIKKLKLTGGEPLLRKDICEIISMMPDFEEISLTTGILLSDLAFDLKESGLDRVISLDTLDAETFRFITGGGELSRVLEGLRMAVEAKLTPIKLMVLMSGLESEVRKMLEFASFEETVILQLIELIPSRTGKFYLDPTIFEKDFERVAKAVKIRDMHRRKQFITPFGVVEIVKPLDTEFCMHCRIRITSDGRIKLCLMSDETVDISELSGDELKKAIFEAVKRRKPFFIMKGEILALISAVLWGFAPILDRYALLSGAPIYAALAIRAFGALIAMLFILSVLRGGLAVEAKAAVLLLIAGAIGGALAMVFYYLALESVGASRTVPITAIYPMFTALFSFLLLSEPLSPKTIAGIAFIVLGVILVSEGMVKLRGEDVVIRKYDHSMDRDKLIEMYVYDPRFRCLGLPPLSKEAIKGWIDYLGQGFAIIAEKDGKIVGHLVIVPGEREVDLTIFIHQDYQLGLGQEMMKLIIDFCRKAGFAITLVTERTARAIHVYRKLGFEIVAPYYEYDMRLQLKMIVPKGKTVLIKGTASIRGECEVLGARLFFESEKFVPVFCLEDCEIEVGEFKILDGSTIPESWEKLSKMDWETVFLYGGVDSGKSTLATYLAKVGGAYVLDLDIGQADVAPGAMGYGFAKDVVSLSKVSMIGFFVGSITPQGREAKCLRGVARLWKELRKLDGRKIIDTTGWVRGRRAKEYKLAKLEIIEPDLIASFEGKLFDWKTFEVEKGYVIRRDKDRAKARFESYRKFLDGAKTFELERDGIKLKPDFFKGKDVSQFIESVLGTRVVFARLGEEHLTICTKEDCPEYEILRELKELYEVDDIFLFSESEARFVAGLYRGKKYLGIGLIKSIDRILLECTQSDFDTIEIGEIRLEDGRECFIKRFMARIAYSYKPQDETRAARAMGYEVPISFKHAMEICRVLKGKKVPQAISFLEEVVQLKVPVPFRKHKKKVAHKIPGWYAGRYPQKAAEILKVLKLKAAEYKGLKAEELIIVHAQAKK'}"
155
+ ]
156
+ },
157
+ "execution_count": 5,
158
+ "metadata": {},
159
+ "output_type": "execute_result"
160
+ }
161
+ ],
162
+ "source": [
163
+ "protein_dataset = load_dataset('text', data_files='data/protein_1g.txt')\n",
164
+ "protein_dataset[\"train\"][0]"
165
+ ]
166
+ },
167
+ {
168
+ "cell_type": "markdown",
169
+ "id": "ecaa8216-7b9f-4ba0-af8e-c7c868dc7ec9",
170
+ "metadata": {},
171
+ "source": [
172
+ "蛋白质序列,则是有MLTDP等20个字母/氨基酸 组成的文本,当然,目前对蛋白质的理解远超过对DNA的。\n",
173
+ "\n",
174
+ "然后就是英文文本了,这个就比较容易看懂"
175
+ ]
176
+ },
177
+ {
178
+ "cell_type": "code",
179
+ "execution_count": 9,
180
+ "id": "7521f8ea-fd70-4f5b-aeeb-7ff81635320d",
181
+ "metadata": {},
182
+ "outputs": [
183
+ {
184
+ "data": {
185
+ "text/plain": [
186
+ "{'text': ' \" There \\'s Got to Be a Way \" is a song by American singer and songwriter Mariah Carey from her self @-@ titled debut studio album ( 1990 ) . Columbia released it as the fifth and final single from the album in the United Kingdom . It was one of four songs Carey wrote with Ric Wake during their first recording session together , but \" There \\'s Got to Be a Way \" was the only composition to make the final track listing . It is a socio @-@ political conscious R & B @-@ pop song which addresses the existence of poverty , racism and war in the world which gradually becomes more aspirational and positive as it progresses . The track garnered a mixed reception upon the album \\'s release in 1990 . While Carey \\'s vocals were praised , it was seen as too political . An accompanying music video highlights social injustices . The song reached number 54 on the UK Singles Chart . '}"
187
+ ]
188
+ },
189
+ "execution_count": 9,
190
+ "metadata": {},
191
+ "output_type": "execute_result"
192
+ }
193
+ ],
194
+ "source": [
195
+ "english_dataset = load_dataset('text', data_files='data/english_500m.txt')\n",
196
+ "english_dataset[\"train\"][301]"
197
+ ]
198
+ },
199
+ {
200
+ "cell_type": "markdown",
201
+ "id": "5fcad08d-6389-453e-997f-eb2877a5fbbb",
202
+ "metadata": {},
203
+ "source": [
204
+ "英文序列,就是26个字母组成的文本了,当然,英文是包括空格的,生物序列则没有明确的空格"
205
+ ]
206
+ },
207
+ {
208
+ "cell_type": "markdown",
209
+ "id": "5e4e1e85-a187-469d-9950-1c6cbb9c41f7",
210
+ "metadata": {},
211
+ "source": [
212
+ "前面这些数据集,就是常规的文本,一般就是当做预训练数据使用,而分类等下游微调任务,一般都是包含标签的,多写成json或者csv的格式,这里也给出一个例子:"
213
+ ]
214
+ },
215
+ {
216
+ "cell_type": "code",
217
+ "execution_count": 11,
218
+ "id": "c48dd04e-af42-4222-94d5-56a8e08e2cbf",
219
+ "metadata": {},
220
+ "outputs": [
221
+ {
222
+ "data": {
223
+ "application/vnd.jupyter.widget-view+json": {
224
+ "model_id": "7c611d1ab3bb408394196e7929d8e0c5",
225
+ "version_major": 2,
226
+ "version_minor": 0
227
+ },
228
+ "text/plain": [
229
+ "Generating train split: 0 examples [00:00, ? examples/s]"
230
+ ]
231
+ },
232
+ "metadata": {},
233
+ "output_type": "display_data"
234
+ },
235
+ {
236
+ "data": {
237
+ "text/plain": [
238
+ "{'sentence1': 'ATGGAGGAAAATCAGACCATGGTCACAGAGTTCGTCCTGCTGGGATTCTGTCTTGGCCCGAGGATTCACCTAGTTCTTTTTCTGCTTTTCTCTCTCTTCTATACTCTCACCATACTGGGGAATGGGACTATCCTTGCAATGATCTGCCTGGACTCCAGACTCCACACTCCCATGTACTTCTTCCTGTCCCACCTGGCCATTGTCGATATGGCCTATGCCTGCAACACAGTGCCTCAGACACTCATAAACCTCTTGGATGAGACCAGGCCCATCACCTTTGCTGGATGCATGACACAGACCTTTCTCTTCTTGGCTTTTGCCCACACTGAATGTGTGCTCCTTGTTGTGATGTCCTATGACCGGTATGTAGCTATCTGCCACCCGCTACACTACACTGTCATCATGAACTGGAGAGTGTGTACCATTCTGGCTGCTGTTTCCTGGATATTTAGCTTTCTCCTTGCTCTGGTCCATTTAGTTCTCATCCTGAGGCTGCCCTTCTGTGGACCTCATGAAATCAATCACTTCTTCTGTGAAATCCTGTCTGTCCTCAAGCTGGCCTGTGCTGACACAACACTCAATCAGGTCGTTATCTTTGCAGCTTGTGTGTTCATATTAGTGGCCCCCCTATGCTTTGTACTAGTCTCCTACACACGCATCCTGGTGGCCATCCTGAGGATCCAGTCAGGGGAGGGACGCAGAAAGGCCTTCTCTACCTGTTCCTCCCACCTCTGTGTGGTAGGGCTCTTCTTTGGCAGTGCCATTGTCATGTACATGGCCCCCAAGTCCCAGCACCCAGAGGAGCAGCAGAAGGTTCTTTTCCTGTTTTACAGTTTTTTCAACCCCATGCTGAACCCCCTAATCTACAGTCTGAGGAATGCTGAGGTGAAGGGCGCCCTCAAGAGGTCACTGTGCAAAGAAAGTCATTCCTGGTTGGTGTGGTGTTCGGACCATAAATCTTGG',\n",
239
+ " 'sentence2': 'MEENQTMVTEFVLLGFCLGPRIHLVLFLLFSLFYTLTILGNGTILAMICLDSRLHTPMYFFLSHLAIVDMAYACNTVPQTLINLLDETRPITFAGCMTQTFLFLAFAHTECVLLVVMSYDRYVAICHPLHYTVIMNWRVCTILAAVSWIFSFLLALVHLVLILRLPFCGPHEINHFFCEILSVLKLACADTTLNQVVIFAACVFILVAPLCFVLVSYTRILVAILRIQSGEGRRKAFSTCSSHLCVVGLFFGSAIVMYMAPKSQHPEEQQKVLFLFYSFFNPMLNPLIYSLRNAEVKGALKRSLCKESHSWLVWCSDHKSW',\n",
240
+ " 'label': 1}"
241
+ ]
242
+ },
243
+ "execution_count": 11,
244
+ "metadata": {},
245
+ "output_type": "execute_result"
246
+ }
247
+ ],
248
+ "source": [
249
+ "ft_dataset = load_dataset('json', data_files='data/dna_protein_my.json')\n",
250
+ "ft_dataset[\"train\"][0]"
251
+ ]
252
+ },
253
+ {
254
+ "cell_type": "code",
255
+ "execution_count": null,
256
+ "id": "8f3ec639-e426-4233-a20a-dad94069175b",
257
+ "metadata": {},
258
+ "outputs": [],
259
+ "source": []
260
+ }
261
+ ],
262
+ "metadata": {
263
+ "kernelspec": {
264
+ "display_name": "Python 3 (ipykernel)",
265
+ "language": "python",
266
+ "name": "python3"
267
+ },
268
+ "language_info": {
269
+ "codemirror_mode": {
270
+ "name": "ipython",
271
+ "version": 3
272
+ },
273
+ "file_extension": ".py",
274
+ "mimetype": "text/x-python",
275
+ "name": "python",
276
+ "nbconvert_exporter": "python",
277
+ "pygments_lexer": "ipython3",
278
+ "version": "3.12.3"
279
+ }
280
+ },
281
+ "nbformat": 4,
282
+ "nbformat_minor": 5
283
+ }
01-data_env/.ipynb_checkpoints/3-dataset-use-checkpoint.ipynb ADDED
@@ -0,0 +1,256 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "markdown",
5
+ "id": "68c06a52-e27c-4da6-8a02-cd010270bedf",
6
+ "metadata": {},
7
+ "source": [
8
+ "# 3 datasets库基本使用"
9
+ ]
10
+ },
11
+ {
12
+ "cell_type": "markdown",
13
+ "id": "2dc4c70f-694c-4785-81d8-26ebab2b7210",
14
+ "metadata": {},
15
+ "source": [
16
+ "## 基本使用\n",
17
+ "上一节中,已经介绍了使用datasets读取本地文件的方法,这一节继续介绍datasets一些常用的方法\n",
18
+ "\n",
19
+ "首先是数据分割,因为我们从数据源获得DNA序列等数据,都是一个文本文件,但训练的时候,一般都需要分成训练集和测试集等\n",
20
+ "\n",
21
+ "一个简单的例子如下所示:"
22
+ ]
23
+ },
24
+ {
25
+ "cell_type": "code",
26
+ "execution_count": 1,
27
+ "id": "6e9f346f-31f6-40cc-86e5-723c65033883",
28
+ "metadata": {},
29
+ "outputs": [
30
+ {
31
+ "data": {
32
+ "text/plain": [
33
+ "DatasetDict({\n",
34
+ " train: Dataset({\n",
35
+ " features: ['text'],\n",
36
+ " num_rows: 1025615\n",
37
+ " })\n",
38
+ " test: Dataset({\n",
39
+ " features: ['text'],\n",
40
+ " num_rows: 53980\n",
41
+ " })\n",
42
+ "})"
43
+ ]
44
+ },
45
+ "execution_count": 1,
46
+ "metadata": {},
47
+ "output_type": "execute_result"
48
+ }
49
+ ],
50
+ "source": [
51
+ "#读取dna数据\n",
52
+ "from datasets import load_dataset\n",
53
+ "dna_dataset = load_dataset('text', data_files='data/dna_1g.txt')['train'].rain_test_split(test_size=0.05) #默认已经shuffle\n",
54
+ "dna_dataset"
55
+ ]
56
+ },
57
+ {
58
+ "cell_type": "code",
59
+ "execution_count": 2,
60
+ "id": "75900650-74da-4ca9-a285-b2832a5a1485",
61
+ "metadata": {},
62
+ "outputs": [
63
+ {
64
+ "data": {
65
+ "text/plain": [
66
+ "{'text': 'ATGTGTGCAATGGGTTATCTTTATGTAATAACAGTCATATCACGGGTGTTCCTCAGAAGTAGTGAACTGGCTAGCATTTTTAGACACTATGTGATCTCTCATATGACTACACTCAATTTAAAATAAAATGAAATGTGTTGTGTGTGTCTAAAATCTATAAAGGGAAAAGTATCTTAAGTATTTTTTAGATGTTAAAGTAGATGTGTATCCTAAAATATGCATTGTTCACAGATGTTAAAATTACAACTACAATCTGTGAAACACAGATCTTAGGACAGCAATGTTTCACAAGAAAAAAAATGATGCAGCCTTCTTTAGTATTTATAGTCATTTGAACAATTATGGCAACCATAAGTTCATATATAACATCCCCATTTGGTGAAACTAGTTGGGAAAGATTAGAAGGTATGACCTTGTTGGAGGAACTATACCATTGGGGTGGCTTTGAGACTTCAGAAGTTTCAAGGCCCATTTAGTGCTTTCTACCTTATGAAGCTGTGAGTTCTCCTTGCTAGCTACATAACTTGGAAAGCAGGCCCTGCACTTCACCCAAGGAGCACATTAGAGCTGGCCCTTTTGGAAGGCAATTGCGTAAGCCACACCAGGGCACCAGAGATCTGGCACTGCCATGCTCCTGCTTGCAAGTAGTGGTGTGGGTGTTGGGTGATGCCCTCCAGTCCCACCTTTTGCCACCTGTAGTAGTCAGGGGAGTTGGCCTAAGGGCATGAGAGCCTAAGACTTCACCCTAATCCCTCACCAACTGTAGCATGTGGAAGAGCAGGCTCTGTACCTTCCCTGGGCAACACATTGGAGCTGGCCCCTCACAGGCTGCAGGACTTGGGAGAGTGAGTGCTGCACCTTGACTGTGAAGGTGGTTTTGGAGGTGTGGGTGTGAGACCATGAGACCAAGAGAGGAATGGAATATTACTCACTTATTAAAAACAATGACTTCATGAAATTTGCAGGCAAATGGATGGAACTTGAAAATATCCTGAGTGAG'}"
67
+ ]
68
+ },
69
+ "execution_count": 2,
70
+ "metadata": {},
71
+ "output_type": "execute_result"
72
+ }
73
+ ],
74
+ "source": [
75
+ "dna_dataset[\"test\"][0]"
76
+ ]
77
+ },
78
+ {
79
+ "cell_type": "markdown",
80
+ "id": "cdcc5404-6590-47a4-be2c-2c1d35d3bae4",
81
+ "metadata": {},
82
+ "source": [
83
+ "可以看到,数据集已经分割成了train和test两个数据集,而在分割的时候,已经进行的随机处理\n",
84
+ "\n",
85
+ "当然,如果数据集过大,我们只需要其中一部分,这个也是一个常见的需求,一般可以使用 Dataset.select() 函数"
86
+ ]
87
+ },
88
+ {
89
+ "cell_type": "code",
90
+ "execution_count": 4,
91
+ "id": "049ad194-cb60-4b0f-8554-1915bfc7a9cd",
92
+ "metadata": {},
93
+ "outputs": [
94
+ {
95
+ "data": {
96
+ "text/plain": [
97
+ "DatasetDict({\n",
98
+ " train: Dataset({\n",
99
+ " features: ['text'],\n",
100
+ " num_rows: 50000\n",
101
+ " })\n",
102
+ " valid: Dataset({\n",
103
+ " features: ['text'],\n",
104
+ " num_rows: 500\n",
105
+ " })\n",
106
+ "})"
107
+ ]
108
+ },
109
+ "execution_count": 4,
110
+ "metadata": {},
111
+ "output_type": "execute_result"
112
+ }
113
+ ],
114
+ "source": [
115
+ "from datasets import load_dataset, DatasetDict\n",
116
+ "\n",
117
+ "dna_dataset_sample = DatasetDict(\n",
118
+ " {\n",
119
+ " \"train\": dna_dataset[\"train\"].shuffle().select(range(50000)), \n",
120
+ " \"valid\": dna_dataset[\"test\"].shuffle().select(range(500))\n",
121
+ " }\n",
122
+ ")\n",
123
+ "dna_dataset_sample"
124
+ ]
125
+ },
126
+ {
127
+ "cell_type": "markdown",
128
+ "id": "50cceda3-36ca-4fa6-bfb5-1dbeb155fe4f",
129
+ "metadata": {},
130
+ "source": [
131
+ "可以看到,我们使用DatasetDict来直接构造datasets,先使用shuffle()来随机,然后使用select来选择前n个数据\n",
132
+ "\n",
133
+ "select的参数为indices (list 或 range): 索引列表或范围对象,指明要选择哪些样本,如dataset.select([0, 2, 4])就是选择1,3,5条记录"
134
+ ]
135
+ },
136
+ {
137
+ "cell_type": "markdown",
138
+ "id": "17a1fa7c-ff4b-419f-8a82-e58cc5777cd4",
139
+ "metadata": {},
140
+ "source": [
141
+ "## 读取线上库\n",
142
+ "\n",
143
+ "当然,数据也可以直接从huggingface的线上仓库读取,这时候需要���意科学上网问题。\n",
144
+ "\n",
145
+ "具体使用函数也是load_dataset"
146
+ ]
147
+ },
148
+ {
149
+ "cell_type": "code",
150
+ "execution_count": 9,
151
+ "id": "6ae24950-2c74-457b-b1f2-d2e4397e1fa1",
152
+ "metadata": {},
153
+ "outputs": [
154
+ {
155
+ "data": {
156
+ "text/plain": [
157
+ "\"\\nimport os\\n\\n# 设置环境变量, autodl专区 其他idc\\nos.environ['HF_ENDPOINT'] = 'https://hf-mirror.com'\\n\\n# 打印环境变量以确认设置成功\\nprint(os.environ.get('HF_ENDPOINT'))\\n\""
158
+ ]
159
+ },
160
+ "execution_count": 9,
161
+ "metadata": {},
162
+ "output_type": "execute_result"
163
+ }
164
+ ],
165
+ "source": [
166
+ "import subprocess\n",
167
+ "import os\n",
168
+ "# 设置环境变量, autodl一般区域\n",
169
+ "result = subprocess.run('bash -c \"source /etc/network_turbo && env | grep proxy\"', shell=True, capture_output=True, text=True)\n",
170
+ "output = result.stdout\n",
171
+ "for line in output.splitlines():\n",
172
+ " if '=' in line:\n",
173
+ " var, value = line.split('=', 1)\n",
174
+ " os.environ[var] = value\n",
175
+ "\n",
176
+ "#或者\n",
177
+ "\"\"\"\n",
178
+ "import os\n",
179
+ "\n",
180
+ "# 设置环境变量, autodl专区 其他idc\n",
181
+ "os.environ['HF_ENDPOINT'] = 'https://hf-mirror.com'\n",
182
+ "\n",
183
+ "# 打印环境变量以确认设置成功\n",
184
+ "print(os.environ.get('HF_ENDPOINT'))\n",
185
+ "\"\"\""
186
+ ]
187
+ },
188
+ {
189
+ "cell_type": "code",
190
+ "execution_count": 10,
191
+ "id": "30ff9798-d06d-4992-81fc-03102f03599b",
192
+ "metadata": {},
193
+ "outputs": [
194
+ {
195
+ "data": {
196
+ "text/plain": [
197
+ "DatasetDict({\n",
198
+ " train: Dataset({\n",
199
+ " features: ['sequence', 'label'],\n",
200
+ " num_rows: 59196\n",
201
+ " })\n",
202
+ "})"
203
+ ]
204
+ },
205
+ "execution_count": 10,
206
+ "metadata": {},
207
+ "output_type": "execute_result"
208
+ }
209
+ ],
210
+ "source": [
211
+ "from datasets import load_dataset\n",
212
+ "dna_data = load_dataset(\"dnagpt/dna_core_promoter\")\n",
213
+ "dna_data"
214
+ ]
215
+ },
216
+ {
217
+ "cell_type": "markdown",
218
+ "id": "30c4b754-af11-4ac1-9742-45427059617e",
219
+ "metadata": {},
220
+ "source": [
221
+ "当然,如果你想分享你的数据集到huggingface上面,也是一行函数即可:"
222
+ ]
223
+ },
224
+ {
225
+ "cell_type": "code",
226
+ "execution_count": null,
227
+ "id": "f9847be9-e085-41e3-ad29-a450cc017d64",
228
+ "metadata": {},
229
+ "outputs": [],
230
+ "source": [
231
+ "dna_data.push_to_hub(\"org_name/your_dataset_name\", token=\"hf_yourtoken\")"
232
+ ]
233
+ }
234
+ ],
235
+ "metadata": {
236
+ "kernelspec": {
237
+ "display_name": "Python 3 (ipykernel)",
238
+ "language": "python",
239
+ "name": "python3"
240
+ },
241
+ "language_info": {
242
+ "codemirror_mode": {
243
+ "name": "ipython",
244
+ "version": 3
245
+ },
246
+ "file_extension": ".py",
247
+ "mimetype": "text/x-python",
248
+ "name": "python",
249
+ "nbconvert_exporter": "python",
250
+ "pygments_lexer": "ipython3",
251
+ "version": "3.12.3"
252
+ }
253
+ },
254
+ "nbformat": 4,
255
+ "nbformat_minor": 5
256
+ }
01-data_env/1-env-intro.ipynb ADDED
@@ -0,0 +1,89 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "markdown",
5
+ "id": "a25f3d36-e14f-4afd-8926-32748a42e1d1",
6
+ "metadata": {},
7
+ "source": [
8
+ "# 1 大模型运行环境简介\n",
9
+ "\n",
10
+ "\n",
11
+ "建议直接使用autodl,google colab等环境\n",
12
+ "\n",
13
+ "显卡:4090或者4090d\n",
14
+ "\n",
15
+ "内存:32G至少\n",
16
+ "\n",
17
+ "torch>=2.3.0\n",
18
+ "\n",
19
+ "具体可以参考:https://zhuanlan.zhihu.com/p/13479003076\n",
20
+ "\n",
21
+ "pip安装下面的基本transformer环境即可:"
22
+ ]
23
+ },
24
+ {
25
+ "cell_type": "code",
26
+ "execution_count": null,
27
+ "id": "cdeae2e5-2a39-4370-a5ec-47780f8fa76a",
28
+ "metadata": {},
29
+ "outputs": [],
30
+ "source": [
31
+ "!pip install transformers sentencepiece google protobuf deepspeed peft datasets "
32
+ ]
33
+ },
34
+ {
35
+ "cell_type": "markdown",
36
+ "id": "a355a6e6-62fc-4b8f-ba35-b9c2f0ef48c8",
37
+ "metadata": {},
38
+ "source": [
39
+ "如要运行deepspeed,一般使用一机多卡即可,本教程一般不会涉及需要多机多卡的案例\n",
40
+ "\n",
41
+ "\n",
42
+ "推荐的gpu主机:\n",
43
+ "* autodl.com, 国内的 \n",
44
+ "* vast.ai, 海外的\n",
45
+ "\n",
46
+ "主流云平台gpu一般都特别贵,也不允许运行4090等显卡。"
47
+ ]
48
+ },
49
+ {
50
+ "cell_type": "markdown",
51
+ "id": "4ee0674c-f001-453f-b0b0-7e3b25309040",
52
+ "metadata": {},
53
+ "source": [
54
+ "另外,建议把jupyter的注释打开,这样非常方便学习\n",
55
+ "\n",
56
+ "<img src=\"img/zhushi.png\" alt=\"示例图片\" width=\"500px\" />"
57
+ ]
58
+ },
59
+ {
60
+ "cell_type": "code",
61
+ "execution_count": null,
62
+ "id": "444adc87-78c8-4209-8260-0c5c4a668ea0",
63
+ "metadata": {},
64
+ "outputs": [],
65
+ "source": []
66
+ }
67
+ ],
68
+ "metadata": {
69
+ "kernelspec": {
70
+ "display_name": "Python 3 (ipykernel)",
71
+ "language": "python",
72
+ "name": "python3"
73
+ },
74
+ "language_info": {
75
+ "codemirror_mode": {
76
+ "name": "ipython",
77
+ "version": 3
78
+ },
79
+ "file_extension": ".py",
80
+ "mimetype": "text/x-python",
81
+ "name": "python",
82
+ "nbconvert_exporter": "python",
83
+ "pygments_lexer": "ipython3",
84
+ "version": "3.12.3"
85
+ }
86
+ },
87
+ "nbformat": 4,
88
+ "nbformat_minor": 5
89
+ }
01-data_env/2-data-intro.ipynb ADDED
@@ -0,0 +1,283 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "markdown",
5
+ "id": "50ff8836-7075-4858-b463-c99f973f408d",
6
+ "metadata": {},
7
+ "source": [
8
+ "# 2 基因相关预训练和微调数据"
9
+ ]
10
+ },
11
+ {
12
+ "cell_type": "markdown",
13
+ "id": "17cde5bb-70e5-437e-a4a3-193a881dd412",
14
+ "metadata": {},
15
+ "source": [
16
+ "本教程主要关注基因相关的生物序列数据,包括主要的DNA和蛋白质序列,data目录下数据如下:\n",
17
+ "\n",
18
+ "* dna_1g.txt DNA序列数据,大小1G,从glue数据集中抽取,具体可参考dnabert2的论文,包括多个模式生物的数据\n",
19
+ "* potein_1g.txt 蛋白质序列数据,大小1G,从pdb数据库中抽取\n",
20
+ "* english_500m.txt 英文数据,大小500M,就是英文百科"
21
+ ]
22
+ },
23
+ {
24
+ "cell_type": "markdown",
25
+ "id": "b45ecf27-1514-45e0-bfbd-361e6dcc98ea",
26
+ "metadata": {},
27
+ "source": [
28
+ "下面演示下huggingface的dataset库的基本用法,以及样例数据"
29
+ ]
30
+ },
31
+ {
32
+ "cell_type": "code",
33
+ "execution_count": 3,
34
+ "id": "2715f9bb-2e43-4bd6-8715-5c96d317bcf8",
35
+ "metadata": {},
36
+ "outputs": [
37
+ {
38
+ "data": {
39
+ "application/vnd.jupyter.widget-view+json": {
40
+ "model_id": "c067aeb8ab304723ac6b527e7ad6c768",
41
+ "version_major": 2,
42
+ "version_minor": 0
43
+ },
44
+ "text/plain": [
45
+ "Generating train split: 0 examples [00:00, ? examples/s]"
46
+ ]
47
+ },
48
+ "metadata": {},
49
+ "output_type": "display_data"
50
+ },
51
+ {
52
+ "data": {
53
+ "text/plain": [
54
+ "DatasetDict({\n",
55
+ " train: Dataset({\n",
56
+ " features: ['text'],\n",
57
+ " num_rows: 1079595\n",
58
+ " })\n",
59
+ "})"
60
+ ]
61
+ },
62
+ "execution_count": 3,
63
+ "metadata": {},
64
+ "output_type": "execute_result"
65
+ }
66
+ ],
67
+ "source": [
68
+ "#读取dna数据\n",
69
+ "from datasets import load_dataset\n",
70
+ "dna_dataset = load_dataset('text', data_files='data/dna_1g.txt')\n",
71
+ "dna_dataset"
72
+ ]
73
+ },
74
+ {
75
+ "cell_type": "markdown",
76
+ "id": "ec00ad72-c5f9-40db-8508-6c6bf8f374c1",
77
+ "metadata": {},
78
+ "source": [
79
+ "\n",
80
+ "Datasets 提供了加载脚本来加载本地和远程数据集。它支持几种常见的数据格式,例如:\n",
81
+ "\n",
82
+ "| Data format | Loading script | Example |\n",
83
+ "|-------------------|----------------|-------------------------------------------------------------------------|\n",
84
+ "| CSV & TSV | csv | `load_dataset(\"csv\", data_files=\"my_file.csv\")` |\n",
85
+ "| Text files | text | `load_dataset(\"text\", data_files=\"my_file.txt\")` |\n",
86
+ "| JSON & JSON Lines | json | `load_dataset(\"json\", data_files=\"my_file.jsonl\")` |\n",
87
+ "| Pickled DataFrames| pandas | `load_dataset(\"pandas\", data_files=\"my_dataframe.pkl\")` |\n",
88
+ "\n",
89
+ "如表所示, 对于每种数据格式, 我们只需要使用 load_dataset() 函数, 使用 data_files 指定一个或多个文件的路径的参数。 "
90
+ ]
91
+ },
92
+ {
93
+ "cell_type": "markdown",
94
+ "id": "24c40ec7-cb59-4c3a-8052-00d7979f6208",
95
+ "metadata": {},
96
+ "source": [
97
+ "load_dataset默认加载到train下,可以把dataset当做一个一般的python dict使用"
98
+ ]
99
+ },
100
+ {
101
+ "cell_type": "code",
102
+ "execution_count": 4,
103
+ "id": "2a375409-d2b6-4648-8f6a-8ac3fb25bb75",
104
+ "metadata": {},
105
+ "outputs": [
106
+ {
107
+ "data": {
108
+ "text/plain": [
109
+ "{'text': 'TTAAATCCTAGAAGTTGGTTACACGGGTGAGGAAAATGGTGAGAAGCCCAATGGGATGCTGTAGCAATGACAGTGAACTGCTGTCACCCCTGAGGCTGGAAAGATAACAGACATTTGCCAGGAGCTAGAAGCTGGGGCAGCCTGGTAGGAGCGAGAATATGGTGAGAGCTGCCCCCTGGGGATGGAACCACAGAGGGAGGGTCTCTCTGATGAGACATAGAGCCAAGAACAGATACAGCCATTGTGGGAGATGGTAACCAAAGCAGAGAGAGAGAGAGAGAGCGAGAGAGAGAGAAAACACCCTGGTTTCTTCCTTCCTTCCACCTTTGAGTTTCCCACCAGTGCTTCCCATTAGCCCAAACTACCAAGAACCCAGAGGGCAAAGGAGCCCGGGAAATCTAATTCTACATGATACCGAGCAAAGCCGATGTTCCAGCTGGCTGCGTCTGTTACAGTAGGTAGTCAGGCAGACATAAGCAGGGCAGGAGAGGGCTCCTCCCAACCAGGAATGTCAGGTGACGGTCAGGTGATGGTCAGGTGGTCATTAACTGTCTCTCTAAAATAATAATTGGTTACAGCCAGCACCAGGGAAAGGCAGTCTCCCAACCGATAGAAACATCTGAAACTGATGATCAGTAGCTTCCCAATAAGGTCTCAGGAGTTGGACGCATGGGCTCAGCATGAACACTGAGAGGCAAAATGGTGGAGTTTAACTGGTATATGACCTTCCTCTAGAAACATTCAGCTGGTAAGGGAAGAACGCCTTAAGCGAATATGCACGCAACTCCAGTAAACACTGTGCATGTTCCTGTCCCAATGCTGGTAGACCACTGCGCATGCAAACAGCCCACCCCAGGGAAGAATCAGGAGAGAAGAGACCCCACAAGCATGCCAACACATAAAACCCCAAGTCAGGAGTCAAACCATGCACTTGAATCAAGTCACCCACTTAGCTCTCTTTCAAGTGTATTTTACTTTCTTTCATTCCTGCTCTAAAACT'}"
110
+ ]
111
+ },
112
+ "execution_count": 4,
113
+ "metadata": {},
114
+ "output_type": "execute_result"
115
+ }
116
+ ],
117
+ "source": [
118
+ "dna_dataset[\"train\"][0]"
119
+ ]
120
+ },
121
+ {
122
+ "cell_type": "raw",
123
+ "id": "985bd82a-1ff0-49ef-968d-8d5f6df8d76f",
124
+ "metadata": {},
125
+ "source": [
126
+ "dna数据就是如上所示,由ATCG 4个字母组成的文本,对于学习大语言模型而言,可以不关注其具体的含义,当然,大部分dna序列的含义目前也都没有解读:)\n",
127
+ "\n",
128
+ "然后是蛋白质序列"
129
+ ]
130
+ },
131
+ {
132
+ "cell_type": "code",
133
+ "execution_count": 5,
134
+ "id": "94e3f443-939e-4148-bba6-6cafa90790b6",
135
+ "metadata": {},
136
+ "outputs": [
137
+ {
138
+ "data": {
139
+ "application/vnd.jupyter.widget-view+json": {
140
+ "model_id": "a1023bd5311a4a5dbe96c6c8fdc5b519",
141
+ "version_major": 2,
142
+ "version_minor": 0
143
+ },
144
+ "text/plain": [
145
+ "Generating train split: 0 examples [00:00, ? examples/s]"
146
+ ]
147
+ },
148
+ "metadata": {},
149
+ "output_type": "display_data"
150
+ },
151
+ {
152
+ "data": {
153
+ "text/plain": [
154
+ "{'text': 'MLTDPFGRTIKLRIAVTRCLCIYCHREGESDPGTEMSAERIAEIAKAFYELGIKKLKLTGGEPLLRKDICEIISMMPDFEEISLTTGILLSDLAFDLKESGLDRVISLDTLDAETFRFITGGGELSRVLEGLRMAVEAKLTPIKLMVLMSGLESEVRKMLEFASFEETVILQLIELIPSRTGKFYLDPTIFEKDFERVAKAVKIRDMHRRKQFITPFGVVEIVKPLDTEFCMHCRIRITSDGRIKLCLMSDETVDISELSGDELKKAIFEAVKRRKPFFIMKGEILALISAVLWGFAPILDRYALLSGAPIYAALAIRAFGALIAMLFILSVLRGGLAVEAKAAVLLLIAGAIGGALAMVFYYLALESVGASRTVPITAIYPMFTALFSFLLLSEPLSPKTIAGIAFIVLGVILVSEGMVKLRGEDVVIRKYDHSMDRDKLIEMYVYDPRFRCLGLPPLSKEAIKGWIDYLGQGFAIIAEKDGKIVGHLVIVPGEREVDLTIFIHQDYQLGLGQEMMKLIIDFCRKAGFAITLVTERTARAIHVYRKLGFEIVAPYYEYDMRLQLKMIVPKGKTVLIKGTASIRGECEVLGARLFFESEKFVPVFCLEDCEIEVGEFKILDGSTIPESWEKLSKMDWETVFLYGGVDSGKSTLATYLAKVGGAYVLDLDIGQADVAPGAMGYGFAKDVVSLSKVSMIGFFVGSITPQGREAKCLRGVARLWKELRKLDGRKIIDTTGWVRGRRAKEYKLAKLEIIEPDLIASFEGKLFDWKTFEVEKGYVIRRDKDRAKARFESYRKFLDGAKTFELERDGIKLKPDFFKGKDVSQFIESVLGTRVVFARLGEEHLTICTKEDCPEYEILRELKELYEVDDIFLFSESEARFVAGLYRGKKYLGIGLIKSIDRILLECTQSDFDTIEIGEIRLEDGRECFIKRFMARIAYSYKPQDETRAARAMGYEVPISFKHAMEICRVLKGKKVPQAISFLEEVVQLKVPVPFRKHKKKVAHKIPGWYAGRYPQKAAEILKVLKLKAAEYKGLKAEELIIVHAQAKK'}"
155
+ ]
156
+ },
157
+ "execution_count": 5,
158
+ "metadata": {},
159
+ "output_type": "execute_result"
160
+ }
161
+ ],
162
+ "source": [
163
+ "protein_dataset = load_dataset('text', data_files='data/protein_1g.txt')\n",
164
+ "protein_dataset[\"train\"][0]"
165
+ ]
166
+ },
167
+ {
168
+ "cell_type": "markdown",
169
+ "id": "ecaa8216-7b9f-4ba0-af8e-c7c868dc7ec9",
170
+ "metadata": {},
171
+ "source": [
172
+ "蛋白质序列,则是有MLTDP等20个字母/氨基酸 组成的文本,当然,目前对蛋白质的理解远超过对DNA的。\n",
173
+ "\n",
174
+ "然后就是英文文本了,这个就比较容易看懂"
175
+ ]
176
+ },
177
+ {
178
+ "cell_type": "code",
179
+ "execution_count": 9,
180
+ "id": "7521f8ea-fd70-4f5b-aeeb-7ff81635320d",
181
+ "metadata": {},
182
+ "outputs": [
183
+ {
184
+ "data": {
185
+ "text/plain": [
186
+ "{'text': ' \" There \\'s Got to Be a Way \" is a song by American singer and songwriter Mariah Carey from her self @-@ titled debut studio album ( 1990 ) . Columbia released it as the fifth and final single from the album in the United Kingdom . It was one of four songs Carey wrote with Ric Wake during their first recording session together , but \" There \\'s Got to Be a Way \" was the only composition to make the final track listing . It is a socio @-@ political conscious R & B @-@ pop song which addresses the existence of poverty , racism and war in the world which gradually becomes more aspirational and positive as it progresses . The track garnered a mixed reception upon the album \\'s release in 1990 . While Carey \\'s vocals were praised , it was seen as too political . An accompanying music video highlights social injustices . The song reached number 54 on the UK Singles Chart . '}"
187
+ ]
188
+ },
189
+ "execution_count": 9,
190
+ "metadata": {},
191
+ "output_type": "execute_result"
192
+ }
193
+ ],
194
+ "source": [
195
+ "english_dataset = load_dataset('text', data_files='data/english_500m.txt')\n",
196
+ "english_dataset[\"train\"][301]"
197
+ ]
198
+ },
199
+ {
200
+ "cell_type": "markdown",
201
+ "id": "5fcad08d-6389-453e-997f-eb2877a5fbbb",
202
+ "metadata": {},
203
+ "source": [
204
+ "英文序列,就是26个字母组成的文本了,当然,英文是包括空格的,生物序列则没有明确的空格"
205
+ ]
206
+ },
207
+ {
208
+ "cell_type": "markdown",
209
+ "id": "5e4e1e85-a187-469d-9950-1c6cbb9c41f7",
210
+ "metadata": {},
211
+ "source": [
212
+ "前面这些数据集,就是常规的文本,一般就是当做预训练数据使用,而分类等下游微调任务,一般都是包含标签的,多写成json或者csv的格式,这里也给出一个例子:"
213
+ ]
214
+ },
215
+ {
216
+ "cell_type": "code",
217
+ "execution_count": 11,
218
+ "id": "c48dd04e-af42-4222-94d5-56a8e08e2cbf",
219
+ "metadata": {},
220
+ "outputs": [
221
+ {
222
+ "data": {
223
+ "application/vnd.jupyter.widget-view+json": {
224
+ "model_id": "7c611d1ab3bb408394196e7929d8e0c5",
225
+ "version_major": 2,
226
+ "version_minor": 0
227
+ },
228
+ "text/plain": [
229
+ "Generating train split: 0 examples [00:00, ? examples/s]"
230
+ ]
231
+ },
232
+ "metadata": {},
233
+ "output_type": "display_data"
234
+ },
235
+ {
236
+ "data": {
237
+ "text/plain": [
238
+ "{'sentence1': 'ATGGAGGAAAATCAGACCATGGTCACAGAGTTCGTCCTGCTGGGATTCTGTCTTGGCCCGAGGATTCACCTAGTTCTTTTTCTGCTTTTCTCTCTCTTCTATACTCTCACCATACTGGGGAATGGGACTATCCTTGCAATGATCTGCCTGGACTCCAGACTCCACACTCCCATGTACTTCTTCCTGTCCCACCTGGCCATTGTCGATATGGCCTATGCCTGCAACACAGTGCCTCAGACACTCATAAACCTCTTGGATGAGACCAGGCCCATCACCTTTGCTGGATGCATGACACAGACCTTTCTCTTCTTGGCTTTTGCCCACACTGAATGTGTGCTCCTTGTTGTGATGTCCTATGACCGGTATGTAGCTATCTGCCACCCGCTACACTACACTGTCATCATGAACTGGAGAGTGTGTACCATTCTGGCTGCTGTTTCCTGGATATTTAGCTTTCTCCTTGCTCTGGTCCATTTAGTTCTCATCCTGAGGCTGCCCTTCTGTGGACCTCATGAAATCAATCACTTCTTCTGTGAAATCCTGTCTGTCCTCAAGCTGGCCTGTGCTGACACAACACTCAATCAGGTCGTTATCTTTGCAGCTTGTGTGTTCATATTAGTGGCCCCCCTATGCTTTGTACTAGTCTCCTACACACGCATCCTGGTGGCCATCCTGAGGATCCAGTCAGGGGAGGGACGCAGAAAGGCCTTCTCTACCTGTTCCTCCCACCTCTGTGTGGTAGGGCTCTTCTTTGGCAGTGCCATTGTCATGTACATGGCCCCCAAGTCCCAGCACCCAGAGGAGCAGCAGAAGGTTCTTTTCCTGTTTTACAGTTTTTTCAACCCCATGCTGAACCCCCTAATCTACAGTCTGAGGAATGCTGAGGTGAAGGGCGCCCTCAAGAGGTCACTGTGCAAAGAAAGTCATTCCTGGTTGGTGTGGTGTTCGGACCATAAATCTTGG',\n",
239
+ " 'sentence2': 'MEENQTMVTEFVLLGFCLGPRIHLVLFLLFSLFYTLTILGNGTILAMICLDSRLHTPMYFFLSHLAIVDMAYACNTVPQTLINLLDETRPITFAGCMTQTFLFLAFAHTECVLLVVMSYDRYVAICHPLHYTVIMNWRVCTILAAVSWIFSFLLALVHLVLILRLPFCGPHEINHFFCEILSVLKLACADTTLNQVVIFAACVFILVAPLCFVLVSYTRILVAILRIQSGEGRRKAFSTCSSHLCVVGLFFGSAIVMYMAPKSQHPEEQQKVLFLFYSFFNPMLNPLIYSLRNAEVKGALKRSLCKESHSWLVWCSDHKSW',\n",
240
+ " 'label': 1}"
241
+ ]
242
+ },
243
+ "execution_count": 11,
244
+ "metadata": {},
245
+ "output_type": "execute_result"
246
+ }
247
+ ],
248
+ "source": [
249
+ "ft_dataset = load_dataset('json', data_files='data/dna_protein_my.json')\n",
250
+ "ft_dataset[\"train\"][0]"
251
+ ]
252
+ },
253
+ {
254
+ "cell_type": "code",
255
+ "execution_count": null,
256
+ "id": "8f3ec639-e426-4233-a20a-dad94069175b",
257
+ "metadata": {},
258
+ "outputs": [],
259
+ "source": []
260
+ }
261
+ ],
262
+ "metadata": {
263
+ "kernelspec": {
264
+ "display_name": "Python 3 (ipykernel)",
265
+ "language": "python",
266
+ "name": "python3"
267
+ },
268
+ "language_info": {
269
+ "codemirror_mode": {
270
+ "name": "ipython",
271
+ "version": 3
272
+ },
273
+ "file_extension": ".py",
274
+ "mimetype": "text/x-python",
275
+ "name": "python",
276
+ "nbconvert_exporter": "python",
277
+ "pygments_lexer": "ipython3",
278
+ "version": "3.12.3"
279
+ }
280
+ },
281
+ "nbformat": 4,
282
+ "nbformat_minor": 5
283
+ }
01-data_env/3-dataset-use.ipynb ADDED
@@ -0,0 +1,256 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "markdown",
5
+ "id": "68c06a52-e27c-4da6-8a02-cd010270bedf",
6
+ "metadata": {},
7
+ "source": [
8
+ "# 3 datasets库基本使用"
9
+ ]
10
+ },
11
+ {
12
+ "cell_type": "markdown",
13
+ "id": "2dc4c70f-694c-4785-81d8-26ebab2b7210",
14
+ "metadata": {},
15
+ "source": [
16
+ "## 基本使用\n",
17
+ "上一节中,已经介绍了使用datasets读取本地文件的方法,这一节继续介绍datasets一些常用的方法\n",
18
+ "\n",
19
+ "首先是数据分割,因为我们从数据源获得DNA序列等数据,都是一个文本文件,但训练的时候,一般都需要分成训练集和测试集等\n",
20
+ "\n",
21
+ "一个简单的例子如下所示:"
22
+ ]
23
+ },
24
+ {
25
+ "cell_type": "code",
26
+ "execution_count": 1,
27
+ "id": "6e9f346f-31f6-40cc-86e5-723c65033883",
28
+ "metadata": {},
29
+ "outputs": [
30
+ {
31
+ "data": {
32
+ "text/plain": [
33
+ "DatasetDict({\n",
34
+ " train: Dataset({\n",
35
+ " features: ['text'],\n",
36
+ " num_rows: 1025615\n",
37
+ " })\n",
38
+ " test: Dataset({\n",
39
+ " features: ['text'],\n",
40
+ " num_rows: 53980\n",
41
+ " })\n",
42
+ "})"
43
+ ]
44
+ },
45
+ "execution_count": 1,
46
+ "metadata": {},
47
+ "output_type": "execute_result"
48
+ }
49
+ ],
50
+ "source": [
51
+ "#读取dna数据\n",
52
+ "from datasets import load_dataset\n",
53
+ "dna_dataset = load_dataset('text', data_files='data/dna_1g.txt')['train'].rain_test_split(test_size=0.05) #默认已经shuffle\n",
54
+ "dna_dataset"
55
+ ]
56
+ },
57
+ {
58
+ "cell_type": "code",
59
+ "execution_count": 2,
60
+ "id": "75900650-74da-4ca9-a285-b2832a5a1485",
61
+ "metadata": {},
62
+ "outputs": [
63
+ {
64
+ "data": {
65
+ "text/plain": [
66
+ "{'text': 'ATGTGTGCAATGGGTTATCTTTATGTAATAACAGTCATATCACGGGTGTTCCTCAGAAGTAGTGAACTGGCTAGCATTTTTAGACACTATGTGATCTCTCATATGACTACACTCAATTTAAAATAAAATGAAATGTGTTGTGTGTGTCTAAAATCTATAAAGGGAAAAGTATCTTAAGTATTTTTTAGATGTTAAAGTAGATGTGTATCCTAAAATATGCATTGTTCACAGATGTTAAAATTACAACTACAATCTGTGAAACACAGATCTTAGGACAGCAATGTTTCACAAGAAAAAAAATGATGCAGCCTTCTTTAGTATTTATAGTCATTTGAACAATTATGGCAACCATAAGTTCATATATAACATCCCCATTTGGTGAAACTAGTTGGGAAAGATTAGAAGGTATGACCTTGTTGGAGGAACTATACCATTGGGGTGGCTTTGAGACTTCAGAAGTTTCAAGGCCCATTTAGTGCTTTCTACCTTATGAAGCTGTGAGTTCTCCTTGCTAGCTACATAACTTGGAAAGCAGGCCCTGCACTTCACCCAAGGAGCACATTAGAGCTGGCCCTTTTGGAAGGCAATTGCGTAAGCCACACCAGGGCACCAGAGATCTGGCACTGCCATGCTCCTGCTTGCAAGTAGTGGTGTGGGTGTTGGGTGATGCCCTCCAGTCCCACCTTTTGCCACCTGTAGTAGTCAGGGGAGTTGGCCTAAGGGCATGAGAGCCTAAGACTTCACCCTAATCCCTCACCAACTGTAGCATGTGGAAGAGCAGGCTCTGTACCTTCCCTGGGCAACACATTGGAGCTGGCCCCTCACAGGCTGCAGGACTTGGGAGAGTGAGTGCTGCACCTTGACTGTGAAGGTGGTTTTGGAGGTGTGGGTGTGAGACCATGAGACCAAGAGAGGAATGGAATATTACTCACTTATTAAAAACAATGACTTCATGAAATTTGCAGGCAAATGGATGGAACTTGAAAATATCCTGAGTGAG'}"
67
+ ]
68
+ },
69
+ "execution_count": 2,
70
+ "metadata": {},
71
+ "output_type": "execute_result"
72
+ }
73
+ ],
74
+ "source": [
75
+ "dna_dataset[\"test\"][0]"
76
+ ]
77
+ },
78
+ {
79
+ "cell_type": "markdown",
80
+ "id": "cdcc5404-6590-47a4-be2c-2c1d35d3bae4",
81
+ "metadata": {},
82
+ "source": [
83
+ "可以看到,数据集已经分割成了train和test两个数据集,而在分割的时候,已经进行的随机处理\n",
84
+ "\n",
85
+ "当然,如果数据集过大,我们只需要其中一部分,这个也是一个常见的需求,一般可以使用 Dataset.select() 函数"
86
+ ]
87
+ },
88
+ {
89
+ "cell_type": "code",
90
+ "execution_count": 4,
91
+ "id": "049ad194-cb60-4b0f-8554-1915bfc7a9cd",
92
+ "metadata": {},
93
+ "outputs": [
94
+ {
95
+ "data": {
96
+ "text/plain": [
97
+ "DatasetDict({\n",
98
+ " train: Dataset({\n",
99
+ " features: ['text'],\n",
100
+ " num_rows: 50000\n",
101
+ " })\n",
102
+ " valid: Dataset({\n",
103
+ " features: ['text'],\n",
104
+ " num_rows: 500\n",
105
+ " })\n",
106
+ "})"
107
+ ]
108
+ },
109
+ "execution_count": 4,
110
+ "metadata": {},
111
+ "output_type": "execute_result"
112
+ }
113
+ ],
114
+ "source": [
115
+ "from datasets import load_dataset, DatasetDict\n",
116
+ "\n",
117
+ "dna_dataset_sample = DatasetDict(\n",
118
+ " {\n",
119
+ " \"train\": dna_dataset[\"train\"].shuffle().select(range(50000)), \n",
120
+ " \"valid\": dna_dataset[\"test\"].shuffle().select(range(500))\n",
121
+ " }\n",
122
+ ")\n",
123
+ "dna_dataset_sample"
124
+ ]
125
+ },
126
+ {
127
+ "cell_type": "markdown",
128
+ "id": "50cceda3-36ca-4fa6-bfb5-1dbeb155fe4f",
129
+ "metadata": {},
130
+ "source": [
131
+ "可以看到,我们使用DatasetDict来直接构造datasets,先使用shuffle()来随机,然后使用select来选择前n个数据\n",
132
+ "\n",
133
+ "select的参数为indices (list 或 range): 索引列表或范围对象,指明要选择哪些样本,如dataset.select([0, 2, 4])就是选择1,3,5条记录"
134
+ ]
135
+ },
136
+ {
137
+ "cell_type": "markdown",
138
+ "id": "17a1fa7c-ff4b-419f-8a82-e58cc5777cd4",
139
+ "metadata": {},
140
+ "source": [
141
+ "## 读取线上库\n",
142
+ "\n",
143
+ "当然,数据也可以直接从huggingface的线上仓库读取,这时候需要���意科学上网问题。\n",
144
+ "\n",
145
+ "具体使用函数也是load_dataset"
146
+ ]
147
+ },
148
+ {
149
+ "cell_type": "code",
150
+ "execution_count": 9,
151
+ "id": "6ae24950-2c74-457b-b1f2-d2e4397e1fa1",
152
+ "metadata": {},
153
+ "outputs": [
154
+ {
155
+ "data": {
156
+ "text/plain": [
157
+ "\"\\nimport os\\n\\n# 设置环境变量, autodl专区 其他idc\\nos.environ['HF_ENDPOINT'] = 'https://hf-mirror.com'\\n\\n# 打印环境变量以确认设置成功\\nprint(os.environ.get('HF_ENDPOINT'))\\n\""
158
+ ]
159
+ },
160
+ "execution_count": 9,
161
+ "metadata": {},
162
+ "output_type": "execute_result"
163
+ }
164
+ ],
165
+ "source": [
166
+ "import subprocess\n",
167
+ "import os\n",
168
+ "# 设置环境变量, autodl一般区域\n",
169
+ "result = subprocess.run('bash -c \"source /etc/network_turbo && env | grep proxy\"', shell=True, capture_output=True, text=True)\n",
170
+ "output = result.stdout\n",
171
+ "for line in output.splitlines():\n",
172
+ " if '=' in line:\n",
173
+ " var, value = line.split('=', 1)\n",
174
+ " os.environ[var] = value\n",
175
+ "\n",
176
+ "#或者\n",
177
+ "\"\"\"\n",
178
+ "import os\n",
179
+ "\n",
180
+ "# 设置环境变量, autodl专区 其他idc\n",
181
+ "os.environ['HF_ENDPOINT'] = 'https://hf-mirror.com'\n",
182
+ "\n",
183
+ "# 打印环境变量以确认设置成功\n",
184
+ "print(os.environ.get('HF_ENDPOINT'))\n",
185
+ "\"\"\""
186
+ ]
187
+ },
188
+ {
189
+ "cell_type": "code",
190
+ "execution_count": 10,
191
+ "id": "30ff9798-d06d-4992-81fc-03102f03599b",
192
+ "metadata": {},
193
+ "outputs": [
194
+ {
195
+ "data": {
196
+ "text/plain": [
197
+ "DatasetDict({\n",
198
+ " train: Dataset({\n",
199
+ " features: ['sequence', 'label'],\n",
200
+ " num_rows: 59196\n",
201
+ " })\n",
202
+ "})"
203
+ ]
204
+ },
205
+ "execution_count": 10,
206
+ "metadata": {},
207
+ "output_type": "execute_result"
208
+ }
209
+ ],
210
+ "source": [
211
+ "from datasets import load_dataset\n",
212
+ "dna_data = load_dataset(\"dnagpt/dna_core_promoter\")\n",
213
+ "dna_data"
214
+ ]
215
+ },
216
+ {
217
+ "cell_type": "markdown",
218
+ "id": "30c4b754-af11-4ac1-9742-45427059617e",
219
+ "metadata": {},
220
+ "source": [
221
+ "当然,如果你想分享你的数据集到huggingface上面,也是一行函数即可:"
222
+ ]
223
+ },
224
+ {
225
+ "cell_type": "code",
226
+ "execution_count": null,
227
+ "id": "f9847be9-e085-41e3-ad29-a450cc017d64",
228
+ "metadata": {},
229
+ "outputs": [],
230
+ "source": [
231
+ "dna_data.push_to_hub(\"org_name/your_dataset_name\", token=\"hf_yourtoken\")"
232
+ ]
233
+ }
234
+ ],
235
+ "metadata": {
236
+ "kernelspec": {
237
+ "display_name": "Python 3 (ipykernel)",
238
+ "language": "python",
239
+ "name": "python3"
240
+ },
241
+ "language_info": {
242
+ "codemirror_mode": {
243
+ "name": "ipython",
244
+ "version": 3
245
+ },
246
+ "file_extension": ".py",
247
+ "mimetype": "text/x-python",
248
+ "name": "python",
249
+ "nbconvert_exporter": "python",
250
+ "pygments_lexer": "ipython3",
251
+ "version": "3.12.3"
252
+ }
253
+ },
254
+ "nbformat": 4,
255
+ "nbformat_minor": 5
256
+ }
01-data_env/data/dna_1g.txt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:32d950f86ccdb368f4652795117d23898dbccfce5a18a0ee84f78aebc43e8742
3
+ size 1080669660
01-data_env/data/dna_protein_my.json ADDED
The diff for this file is too large to render. See raw diff
 
01-data_env/data/english_500m.txt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:085ebb9d461cae266410953bcd2d07d9a08d50cd93db24d5c3e15d38275cd8cd
3
+ size 541727453
01-data_env/data/protein_1g.txt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a1c361441538520a5501605fa483970b80d72b5dbb28dbe5276890c8632ba1d4
3
+ size 1059750637
01-data_env/img/zhushi.png ADDED
02-gpt2_bert/.ipynb_checkpoints/env_ini-checkpoint.ipynb ADDED
@@ -0,0 +1,267 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "code",
5
+ "execution_count": 5,
6
+ "id": "e3fbdac5-cd38-4e41-b5d2-d9d112b4ac1b",
7
+ "metadata": {
8
+ "scrolled": true
9
+ },
10
+ "outputs": [
11
+ {
12
+ "name": "stdout",
13
+ "output_type": "stream",
14
+ "text": [
15
+ "Looking in indexes: http://mirrors.aliyun.com/pypi/simple\n",
16
+ "Requirement already satisfied: transformers in /root/miniconda3/lib/python3.12/site-packages (4.47.1)\n",
17
+ "Requirement already satisfied: sentencepiece in /root/miniconda3/lib/python3.12/site-packages (0.2.0)\n",
18
+ "Requirement already satisfied: google in /root/miniconda3/lib/python3.12/site-packages (3.0.0)\n",
19
+ "Requirement already satisfied: protobuf in /root/miniconda3/lib/python3.12/site-packages (5.27.0)\n",
20
+ "Requirement already satisfied: deepspeed in /root/miniconda3/lib/python3.12/site-packages (0.16.2)\n",
21
+ "Requirement already satisfied: peft in /root/miniconda3/lib/python3.12/site-packages (0.14.0)\n",
22
+ "Collecting datasets\n",
23
+ " Downloading http://mirrors.aliyun.com/pypi/packages/d7/84/0df6c5981f5fc722381662ff8cfbdf8aad64bec875f75d80b55bfef394ce/datasets-3.2.0-py3-none-any.whl (480 kB)\n",
24
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m480.6/480.6 kB\u001b[0m \u001b[31m4.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m00:01\u001b[0m\n",
25
+ "\u001b[?25hRequirement already satisfied: filelock in /root/miniconda3/lib/python3.12/site-packages (from transformers) (3.14.0)\n",
26
+ "Requirement already satisfied: huggingface-hub<1.0,>=0.24.0 in /root/miniconda3/lib/python3.12/site-packages (from transformers) (0.27.0)\n",
27
+ "Requirement already satisfied: numpy>=1.17 in /root/miniconda3/lib/python3.12/site-packages (from transformers) (1.26.4)\n",
28
+ "Requirement already satisfied: packaging>=20.0 in /root/miniconda3/lib/python3.12/site-packages (from transformers) (23.2)\n",
29
+ "Requirement already satisfied: pyyaml>=5.1 in /root/miniconda3/lib/python3.12/site-packages (from transformers) (6.0.1)\n",
30
+ "Requirement already satisfied: regex!=2019.12.17 in /root/miniconda3/lib/python3.12/site-packages (from transformers) (2024.11.6)\n",
31
+ "Requirement already satisfied: requests in /root/miniconda3/lib/python3.12/site-packages (from transformers) (2.31.0)\n",
32
+ "Requirement already satisfied: tokenizers<0.22,>=0.21 in /root/miniconda3/lib/python3.12/site-packages (from transformers) (0.21.0)\n",
33
+ "Requirement already satisfied: safetensors>=0.4.1 in /root/miniconda3/lib/python3.12/site-packages (from transformers) (0.4.5)\n",
34
+ "Requirement already satisfied: tqdm>=4.27 in /root/miniconda3/lib/python3.12/site-packages (from transformers) (4.66.2)\n",
35
+ "Requirement already satisfied: beautifulsoup4 in /root/miniconda3/lib/python3.12/site-packages (from google) (4.12.3)\n",
36
+ "Requirement already satisfied: einops in /root/miniconda3/lib/python3.12/site-packages (from deepspeed) (0.8.0)\n",
37
+ "Requirement already satisfied: hjson in /root/miniconda3/lib/python3.12/site-packages (from deepspeed) (3.1.0)\n",
38
+ "Requirement already satisfied: msgpack in /root/miniconda3/lib/python3.12/site-packages (from deepspeed) (1.1.0)\n",
39
+ "Requirement already satisfied: ninja in /root/miniconda3/lib/python3.12/site-packages (from deepspeed) (1.11.1.3)\n",
40
+ "Requirement already satisfied: psutil in /root/miniconda3/lib/python3.12/site-packages (from deepspeed) (5.9.8)\n",
41
+ "Requirement already satisfied: py-cpuinfo in /root/miniconda3/lib/python3.12/site-packages (from deepspeed) (9.0.0)\n",
42
+ "Requirement already satisfied: pydantic>=2.0.0 in /root/miniconda3/lib/python3.12/site-packages (from deepspeed) (2.10.4)\n",
43
+ "Requirement already satisfied: torch in /root/miniconda3/lib/python3.12/site-packages (from deepspeed) (2.3.0+cu121)\n",
44
+ "Requirement already satisfied: nvidia-ml-py in /root/miniconda3/lib/python3.12/site-packages (from deepspeed) (12.560.30)\n",
45
+ "Requirement already satisfied: accelerate>=0.21.0 in /root/miniconda3/lib/python3.12/site-packages (from peft) (1.2.1)\n",
46
+ "Collecting pyarrow>=15.0.0 (from datasets)\n",
47
+ " Downloading http://mirrors.aliyun.com/pypi/packages/3a/2e/3b99f8a3d9e0ccae0e961978a0d0089b25fb46ebbcfb5ebae3cca179a5b3/pyarrow-18.1.0-cp312-cp312-manylinux_2_28_x86_64.whl (40.1 MB)\n",
48
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m40.1/40.1 MB\u001b[0m \u001b[31m14.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m00:01\u001b[0m\n",
49
+ "\u001b[?25hCollecting dill<0.3.9,>=0.3.0 (from datasets)\n",
50
+ " Downloading http://mirrors.aliyun.com/pypi/packages/c9/7a/cef76fd8438a42f96db64ddaa85280485a9c395e7df3db8158cfec1eee34/dill-0.3.8-py3-none-any.whl (116 kB)\n",
51
+ "\u001b[2K \u001b[90m━━━━━━━���━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m116.3/116.3 kB\u001b[0m \u001b[31m53.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
52
+ "\u001b[?25hCollecting pandas (from datasets)\n",
53
+ " Downloading http://mirrors.aliyun.com/pypi/packages/38/f8/d8fddee9ed0d0c0f4a2132c1dfcf0e3e53265055da8df952a53e7eaf178c/pandas-2.2.3-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.7 MB)\n",
54
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m12.7/12.7 MB\u001b[0m \u001b[31m13.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m00:01\u001b[0m\n",
55
+ "\u001b[?25hCollecting requests (from transformers)\n",
56
+ " Downloading http://mirrors.aliyun.com/pypi/packages/f9/9b/335f9764261e915ed497fcdeb11df5dfd6f7bf257d4a6a2a686d80da4d54/requests-2.32.3-py3-none-any.whl (64 kB)\n",
57
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m64.9/64.9 kB\u001b[0m \u001b[31m31.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
58
+ "\u001b[?25hCollecting tqdm>=4.27 (from transformers)\n",
59
+ " Downloading http://mirrors.aliyun.com/pypi/packages/d0/30/dc54f88dd4a2b5dc8a0279bdd7270e735851848b762aeb1c1184ed1f6b14/tqdm-4.67.1-py3-none-any.whl (78 kB)\n",
60
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m78.5/78.5 kB\u001b[0m \u001b[31m35.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
61
+ "\u001b[?25hCollecting xxhash (from datasets)\n",
62
+ " Downloading http://mirrors.aliyun.com/pypi/packages/11/a7/81dba5010f7e733de88af9555725146fc133be97ce36533867f4c7e75066/xxhash-3.5.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (194 kB)\n",
63
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m194.4/194.4 kB\u001b[0m \u001b[31m6.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
64
+ "\u001b[?25hCollecting multiprocess<0.70.17 (from datasets)\n",
65
+ " Downloading http://mirrors.aliyun.com/pypi/packages/0a/7d/a988f258104dcd2ccf1ed40fdc97e26c4ac351eeaf81d76e266c52d84e2f/multiprocess-0.70.16-py312-none-any.whl (146 kB)\n",
66
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m146.7/146.7 kB\u001b[0m \u001b[31m4.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
67
+ "\u001b[?25hRequirement already satisfied: fsspec<=2024.9.0,>=2023.1.0 in /root/miniconda3/lib/python3.12/site-packages (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets) (2024.5.0)\n",
68
+ "Collecting aiohttp (from datasets)\n",
69
+ " Downloading http://mirrors.aliyun.com/pypi/packages/40/7f/6de218084f9b653026bd7063cd8045123a7ba90c25176465f266976d8c82/aiohttp-3.11.11-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.7 MB)\n",
70
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m1.7/1.7 MB\u001b[0m \u001b[31m16.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m00:01\u001b[0m\n",
71
+ "\u001b[?25hCollecting aiohappyeyeballs>=2.3.0 (from aiohttp->datasets)\n",
72
+ " Downloading http://mirrors.aliyun.com/pypi/packages/b9/74/fbb6559de3607b3300b9be3cc64e97548d55678e44623db17820dbd20002/aiohappyeyeballs-2.4.4-py3-none-any.whl (14 kB)\n",
73
+ "Collecting aiosignal>=1.1.2 (from aiohttp->datasets)\n",
74
+ " Downloading http://mirrors.aliyun.com/pypi/packages/ec/6a/bc7e17a3e87a2985d3e8f4da4cd0f481060eb78fb08596c42be62c90a4d9/aiosignal-1.3.2-py2.py3-none-any.whl (7.6 kB)\n",
75
+ "Requirement already satisfied: attrs>=17.3.0 in /root/miniconda3/lib/python3.12/site-packages (from aiohttp->datasets) (23.2.0)\n",
76
+ "Collecting frozenlist>=1.1.1 (from aiohttp->datasets)\n",
77
+ " Downloading http://mirrors.aliyun.com/pypi/packages/af/f2/64b73a9bb86f5a89fb55450e97cd5c1f84a862d4ff90d9fd1a73ab0f64a5/frozenlist-1.5.0-cp312-cp312-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (283 kB)\n",
78
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m283.6/283.6 kB\u001b[0m \u001b[31m41.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
79
+ "\u001b[?25hCollecting multidict<7.0,>=4.5 (from aiohttp->datasets)\n",
80
+ " Downloading http://mirrors.aliyun.com/pypi/packages/d3/c8/529101d7176fe7dfe1d99604e48d69c5dfdcadb4f06561f465c8ef12b4df/multidict-6.1.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (131 kB)\n",
81
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━��━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m131.0/131.0 kB\u001b[0m \u001b[31m56.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
82
+ "\u001b[?25hCollecting propcache>=0.2.0 (from aiohttp->datasets)\n",
83
+ " Downloading http://mirrors.aliyun.com/pypi/packages/1c/07/ebe102777a830bca91bbb93e3479cd34c2ca5d0361b83be9dbd93104865e/propcache-0.2.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (243 kB)\n",
84
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m243.6/243.6 kB\u001b[0m \u001b[31m41.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
85
+ "\u001b[?25hCollecting yarl<2.0,>=1.17.0 (from aiohttp->datasets)\n",
86
+ " Downloading http://mirrors.aliyun.com/pypi/packages/1a/e1/a097d5755d3ea8479a42856f51d97eeff7a3a7160593332d98f2709b3580/yarl-1.18.3-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (336 kB)\n",
87
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m336.9/336.9 kB\u001b[0m \u001b[31m41.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
88
+ "\u001b[?25hRequirement already satisfied: typing-extensions>=3.7.4.3 in /root/miniconda3/lib/python3.12/site-packages (from huggingface-hub<1.0,>=0.24.0->transformers) (4.12.2)\n",
89
+ "Requirement already satisfied: annotated-types>=0.6.0 in /root/miniconda3/lib/python3.12/site-packages (from pydantic>=2.0.0->deepspeed) (0.7.0)\n",
90
+ "Requirement already satisfied: pydantic-core==2.27.2 in /root/miniconda3/lib/python3.12/site-packages (from pydantic>=2.0.0->deepspeed) (2.27.2)\n",
91
+ "Requirement already satisfied: charset-normalizer<4,>=2 in /root/miniconda3/lib/python3.12/site-packages (from requests->transformers) (2.0.4)\n",
92
+ "Requirement already satisfied: idna<4,>=2.5 in /root/miniconda3/lib/python3.12/site-packages (from requests->transformers) (3.7)\n",
93
+ "Requirement already satisfied: urllib3<3,>=1.21.1 in /root/miniconda3/lib/python3.12/site-packages (from requests->transformers) (2.1.0)\n",
94
+ "Requirement already satisfied: certifi>=2017.4.17 in /root/miniconda3/lib/python3.12/site-packages (from requests->transformers) (2024.2.2)\n",
95
+ "Requirement already satisfied: sympy in /root/miniconda3/lib/python3.12/site-packages (from torch->deepspeed) (1.12.1)\n",
96
+ "Requirement already satisfied: networkx in /root/miniconda3/lib/python3.12/site-packages (from torch->deepspeed) (3.3)\n",
97
+ "Requirement already satisfied: jinja2 in /root/miniconda3/lib/python3.12/site-packages (from torch->deepspeed) (3.1.4)\n",
98
+ "Requirement already satisfied: nvidia-cuda-nvrtc-cu12==12.1.105 in /root/miniconda3/lib/python3.12/site-packages (from torch->deepspeed) (12.1.105)\n",
99
+ "Requirement already satisfied: nvidia-cuda-runtime-cu12==12.1.105 in /root/miniconda3/lib/python3.12/site-packages (from torch->deepspeed) (12.1.105)\n",
100
+ "Requirement already satisfied: nvidia-cuda-cupti-cu12==12.1.105 in /root/miniconda3/lib/python3.12/site-packages (from torch->deepspeed) (12.1.105)\n",
101
+ "Requirement already satisfied: nvidia-cudnn-cu12==8.9.2.26 in /root/miniconda3/lib/python3.12/site-packages (from torch->deepspeed) (8.9.2.26)\n",
102
+ "Requirement already satisfied: nvidia-cublas-cu12==12.1.3.1 in /root/miniconda3/lib/python3.12/site-packages (from torch->deepspeed) (12.1.3.1)\n",
103
+ "Requirement already satisfied: nvidia-cufft-cu12==11.0.2.54 in /root/miniconda3/lib/python3.12/site-packages (from torch->deepspeed) (11.0.2.54)\n",
104
+ "Requirement already satisfied: nvidia-curand-cu12==10.3.2.106 in /root/miniconda3/lib/python3.12/site-packages (from torch->deepspeed) (10.3.2.106)\n",
105
+ "Requirement already satisfied: nvidia-cusolver-cu12==11.4.5.107 in /root/miniconda3/lib/python3.12/site-packages (from torch->deepspeed) (11.4.5.107)\n",
106
+ "Requirement already satisfied: nvidia-cusparse-cu12==12.1.0.106 in /root/miniconda3/lib/python3.12/site-packages (from torch->deepspeed) (12.1.0.106)\n",
107
+ "Requirement already satisfied: nvidia-nccl-cu12==2.20.5 in /root/miniconda3/lib/python3.12/site-packages (from torch->deepspeed) (2.20.5)\n",
108
+ "Requirement already satisfied: nvidia-nvtx-cu12==12.1.105 in /root/miniconda3/lib/python3.12/site-packages (from torch->deepspeed) (12.1.105)\n",
109
+ "Requirement already satisfied: nvidia-nvjitlink-cu12 in /root/miniconda3/lib/python3.12/site-packages (from nvidia-cusolver-cu12==11.4.5.107->torch->deepspeed) (12.5.40)\n",
110
+ "Requirement already satisfied: soupsieve>1.2 in /root/miniconda3/lib/python3.12/site-packages (from beautifulsoup4->google) (2.5)\n",
111
+ "Requirement already satisfied: python-dateutil>=2.8.2 in /root/miniconda3/lib/python3.12/site-packages (from pandas->datasets) (2.9.0.post0)\n",
112
+ "Collecting pytz>=2020.1 (from pandas->datasets)\n",
113
+ " Downloading http://mirrors.aliyun.com/pypi/packages/11/c3/005fcca25ce078d2cc29fd559379817424e94885510568bc1bc53d7d5846/pytz-2024.2-py2.py3-none-any.whl (508 kB)\n",
114
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m508.0/508.0 kB\u001b[0m \u001b[31m38.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
115
+ "\u001b[?25hCollecting tzdata>=2022.7 (from pandas->datasets)\n",
116
+ " Downloading http://mirrors.aliyun.com/pypi/packages/a6/ab/7e5f53c3b9d14972843a647d8d7a853969a58aecc7559cb3267302c94774/tzdata-2024.2-py2.py3-none-any.whl (346 kB)\n",
117
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m346.6/346.6 kB\u001b[0m \u001b[31m36.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
118
+ "\u001b[?25hRequirement already satisfied: six>=1.5 in /root/miniconda3/lib/python3.12/site-packages (from python-dateutil>=2.8.2->pandas->datasets) (1.16.0)\n",
119
+ "Requirement already satisfied: MarkupSafe>=2.0 in /root/miniconda3/lib/python3.12/site-packages (from jinja2->torch->deepspeed) (2.1.5)\n",
120
+ "Requirement already satisfied: mpmath<1.4.0,>=1.1.0 in /root/miniconda3/lib/python3.12/site-packages (from sympy->torch->deepspeed) (1.3.0)\n",
121
+ "Installing collected packages: pytz, xxhash, tzdata, tqdm, requests, pyarrow, propcache, multidict, frozenlist, dill, aiohappyeyeballs, yarl, pandas, multiprocess, aiosignal, aiohttp, datasets\n",
122
+ " Attempting uninstall: tqdm\n",
123
+ " Found existing installation: tqdm 4.66.2\n",
124
+ " Uninstalling tqdm-4.66.2:\n",
125
+ " Successfully uninstalled tqdm-4.66.2\n",
126
+ " Attempting uninstall: requests\n",
127
+ " Found existing installation: requests 2.31.0\n",
128
+ " Uninstalling requests-2.31.0:\n",
129
+ " Successfully uninstalled requests-2.31.0\n",
130
+ "Successfully installed aiohappyeyeballs-2.4.4 aiohttp-3.11.11 aiosignal-1.3.2 datasets-3.2.0 dill-0.3.8 frozenlist-1.5.0 multidict-6.1.0 multiprocess-0.70.16 pandas-2.2.3 propcache-0.2.1 pyarrow-18.1.0 pytz-2024.2 requests-2.32.3 tqdm-4.67.1 tzdata-2024.2 xxhash-3.5.0 yarl-1.18.3\n",
131
+ "\u001b[33mWARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv\u001b[0m\u001b[33m\n",
132
+ "\u001b[0m"
133
+ ]
134
+ }
135
+ ],
136
+ "source": [
137
+ "!pip install transformers sentencepiece google protobuf deepspeed peft datasets "
138
+ ]
139
+ },
140
+ {
141
+ "cell_type": "code",
142
+ "execution_count": 9,
143
+ "id": "4e906370-40c7-4f6b-a700-f183a9308c78",
144
+ "metadata": {},
145
+ "outputs": [
146
+ {
147
+ "name": "stdout",
148
+ "output_type": "stream",
149
+ "text": [
150
+ "https://hf-mirror.com\n"
151
+ ]
152
+ }
153
+ ],
154
+ "source": [
155
+ "import os\n",
156
+ "\n",
157
+ "# 设置环境变量, autodl专区 其他idc\n",
158
+ "os.environ['HF_ENDPOINT'] = 'https://hf-mirror.com'\n",
159
+ "\n",
160
+ "# 打印环境变量以确认设置成功\n",
161
+ "print(os.environ.get('HF_ENDPOINT'))"
162
+ ]
163
+ },
164
+ {
165
+ "cell_type": "code",
166
+ "execution_count": 1,
167
+ "id": "ecc98529-6581-41d2-a876-23ce5187cae7",
168
+ "metadata": {},
169
+ "outputs": [],
170
+ "source": [
171
+ "import subprocess\n",
172
+ "import os\n",
173
+ "# 设置环境变量, autodl一般区域\n",
174
+ "result = subprocess.run('bash -c \"source /etc/network_turbo && env | grep proxy\"', shell=True, capture_output=True, text=True)\n",
175
+ "output = result.stdout\n",
176
+ "for line in output.splitlines():\n",
177
+ " if '=' in line:\n",
178
+ " var, value = line.split('=', 1)\n",
179
+ " os.environ[var] = value"
180
+ ]
181
+ },
182
+ {
183
+ "cell_type": "code",
184
+ "execution_count": 2,
185
+ "id": "b01fc372-33af-46e5-8c0e-8bccba7237ee",
186
+ "metadata": {},
187
+ "outputs": [],
188
+ "source": [
189
+ "from datasets import load_dataset\n",
190
+ "# load ~11k samples from promoters prediction dataset\n",
191
+ "dataset = load_dataset(\"dnagpt/dna_core_promoter\")['train'].train_test_split(test_size=0.1)"
192
+ ]
193
+ },
194
+ {
195
+ "cell_type": "code",
196
+ "execution_count": 3,
197
+ "id": "136c38d4-bd0f-4ecd-9165-2fd5b5207c1d",
198
+ "metadata": {},
199
+ "outputs": [
200
+ {
201
+ "data": {
202
+ "text/plain": [
203
+ "DatasetDict({\n",
204
+ " train: Dataset({\n",
205
+ " features: ['sequence', 'label'],\n",
206
+ " num_rows: 53276\n",
207
+ " })\n",
208
+ " test: Dataset({\n",
209
+ " features: ['sequence', 'label'],\n",
210
+ " num_rows: 5920\n",
211
+ " })\n",
212
+ "})"
213
+ ]
214
+ },
215
+ "execution_count": 3,
216
+ "metadata": {},
217
+ "output_type": "execute_result"
218
+ }
219
+ ],
220
+ "source": [
221
+ "dataset"
222
+ ]
223
+ },
224
+ {
225
+ "cell_type": "markdown",
226
+ "id": "28acb64e-8d1e-4482-a515-344a2e0344c4",
227
+ "metadata": {},
228
+ "source": [
229
+ "## lfs 支持\n",
230
+ "apt-get update\n",
231
+ "\n",
232
+ "apt-get install git-lfs\n",
233
+ "\n",
234
+ "git lfs install"
235
+ ]
236
+ },
237
+ {
238
+ "cell_type": "code",
239
+ "execution_count": null,
240
+ "id": "3d3cefb0-1eed-4f23-8591-1990f7113820",
241
+ "metadata": {},
242
+ "outputs": [],
243
+ "source": []
244
+ }
245
+ ],
246
+ "metadata": {
247
+ "kernelspec": {
248
+ "display_name": "Python 3 (ipykernel)",
249
+ "language": "python",
250
+ "name": "python3"
251
+ },
252
+ "language_info": {
253
+ "codemirror_mode": {
254
+ "name": "ipython",
255
+ "version": 3
256
+ },
257
+ "file_extension": ".py",
258
+ "mimetype": "text/x-python",
259
+ "name": "python",
260
+ "nbconvert_exporter": "python",
261
+ "pygments_lexer": "ipython3",
262
+ "version": "3.12.3"
263
+ }
264
+ },
265
+ "nbformat": 4,
266
+ "nbformat_minor": 5
267
+ }
02-gpt2_bert/env_ini.ipynb ADDED
@@ -0,0 +1,267 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "code",
5
+ "execution_count": 5,
6
+ "id": "e3fbdac5-cd38-4e41-b5d2-d9d112b4ac1b",
7
+ "metadata": {
8
+ "scrolled": true
9
+ },
10
+ "outputs": [
11
+ {
12
+ "name": "stdout",
13
+ "output_type": "stream",
14
+ "text": [
15
+ "Looking in indexes: http://mirrors.aliyun.com/pypi/simple\n",
16
+ "Requirement already satisfied: transformers in /root/miniconda3/lib/python3.12/site-packages (4.47.1)\n",
17
+ "Requirement already satisfied: sentencepiece in /root/miniconda3/lib/python3.12/site-packages (0.2.0)\n",
18
+ "Requirement already satisfied: google in /root/miniconda3/lib/python3.12/site-packages (3.0.0)\n",
19
+ "Requirement already satisfied: protobuf in /root/miniconda3/lib/python3.12/site-packages (5.27.0)\n",
20
+ "Requirement already satisfied: deepspeed in /root/miniconda3/lib/python3.12/site-packages (0.16.2)\n",
21
+ "Requirement already satisfied: peft in /root/miniconda3/lib/python3.12/site-packages (0.14.0)\n",
22
+ "Collecting datasets\n",
23
+ " Downloading http://mirrors.aliyun.com/pypi/packages/d7/84/0df6c5981f5fc722381662ff8cfbdf8aad64bec875f75d80b55bfef394ce/datasets-3.2.0-py3-none-any.whl (480 kB)\n",
24
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m480.6/480.6 kB\u001b[0m \u001b[31m4.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m00:01\u001b[0m\n",
25
+ "\u001b[?25hRequirement already satisfied: filelock in /root/miniconda3/lib/python3.12/site-packages (from transformers) (3.14.0)\n",
26
+ "Requirement already satisfied: huggingface-hub<1.0,>=0.24.0 in /root/miniconda3/lib/python3.12/site-packages (from transformers) (0.27.0)\n",
27
+ "Requirement already satisfied: numpy>=1.17 in /root/miniconda3/lib/python3.12/site-packages (from transformers) (1.26.4)\n",
28
+ "Requirement already satisfied: packaging>=20.0 in /root/miniconda3/lib/python3.12/site-packages (from transformers) (23.2)\n",
29
+ "Requirement already satisfied: pyyaml>=5.1 in /root/miniconda3/lib/python3.12/site-packages (from transformers) (6.0.1)\n",
30
+ "Requirement already satisfied: regex!=2019.12.17 in /root/miniconda3/lib/python3.12/site-packages (from transformers) (2024.11.6)\n",
31
+ "Requirement already satisfied: requests in /root/miniconda3/lib/python3.12/site-packages (from transformers) (2.31.0)\n",
32
+ "Requirement already satisfied: tokenizers<0.22,>=0.21 in /root/miniconda3/lib/python3.12/site-packages (from transformers) (0.21.0)\n",
33
+ "Requirement already satisfied: safetensors>=0.4.1 in /root/miniconda3/lib/python3.12/site-packages (from transformers) (0.4.5)\n",
34
+ "Requirement already satisfied: tqdm>=4.27 in /root/miniconda3/lib/python3.12/site-packages (from transformers) (4.66.2)\n",
35
+ "Requirement already satisfied: beautifulsoup4 in /root/miniconda3/lib/python3.12/site-packages (from google) (4.12.3)\n",
36
+ "Requirement already satisfied: einops in /root/miniconda3/lib/python3.12/site-packages (from deepspeed) (0.8.0)\n",
37
+ "Requirement already satisfied: hjson in /root/miniconda3/lib/python3.12/site-packages (from deepspeed) (3.1.0)\n",
38
+ "Requirement already satisfied: msgpack in /root/miniconda3/lib/python3.12/site-packages (from deepspeed) (1.1.0)\n",
39
+ "Requirement already satisfied: ninja in /root/miniconda3/lib/python3.12/site-packages (from deepspeed) (1.11.1.3)\n",
40
+ "Requirement already satisfied: psutil in /root/miniconda3/lib/python3.12/site-packages (from deepspeed) (5.9.8)\n",
41
+ "Requirement already satisfied: py-cpuinfo in /root/miniconda3/lib/python3.12/site-packages (from deepspeed) (9.0.0)\n",
42
+ "Requirement already satisfied: pydantic>=2.0.0 in /root/miniconda3/lib/python3.12/site-packages (from deepspeed) (2.10.4)\n",
43
+ "Requirement already satisfied: torch in /root/miniconda3/lib/python3.12/site-packages (from deepspeed) (2.3.0+cu121)\n",
44
+ "Requirement already satisfied: nvidia-ml-py in /root/miniconda3/lib/python3.12/site-packages (from deepspeed) (12.560.30)\n",
45
+ "Requirement already satisfied: accelerate>=0.21.0 in /root/miniconda3/lib/python3.12/site-packages (from peft) (1.2.1)\n",
46
+ "Collecting pyarrow>=15.0.0 (from datasets)\n",
47
+ " Downloading http://mirrors.aliyun.com/pypi/packages/3a/2e/3b99f8a3d9e0ccae0e961978a0d0089b25fb46ebbcfb5ebae3cca179a5b3/pyarrow-18.1.0-cp312-cp312-manylinux_2_28_x86_64.whl (40.1 MB)\n",
48
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m40.1/40.1 MB\u001b[0m \u001b[31m14.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m00:01\u001b[0m\n",
49
+ "\u001b[?25hCollecting dill<0.3.9,>=0.3.0 (from datasets)\n",
50
+ " Downloading http://mirrors.aliyun.com/pypi/packages/c9/7a/cef76fd8438a42f96db64ddaa85280485a9c395e7df3db8158cfec1eee34/dill-0.3.8-py3-none-any.whl (116 kB)\n",
51
+ "\u001b[2K \u001b[90m━━━━━━━���━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m116.3/116.3 kB\u001b[0m \u001b[31m53.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
52
+ "\u001b[?25hCollecting pandas (from datasets)\n",
53
+ " Downloading http://mirrors.aliyun.com/pypi/packages/38/f8/d8fddee9ed0d0c0f4a2132c1dfcf0e3e53265055da8df952a53e7eaf178c/pandas-2.2.3-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.7 MB)\n",
54
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m12.7/12.7 MB\u001b[0m \u001b[31m13.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m00:01\u001b[0m\n",
55
+ "\u001b[?25hCollecting requests (from transformers)\n",
56
+ " Downloading http://mirrors.aliyun.com/pypi/packages/f9/9b/335f9764261e915ed497fcdeb11df5dfd6f7bf257d4a6a2a686d80da4d54/requests-2.32.3-py3-none-any.whl (64 kB)\n",
57
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m64.9/64.9 kB\u001b[0m \u001b[31m31.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
58
+ "\u001b[?25hCollecting tqdm>=4.27 (from transformers)\n",
59
+ " Downloading http://mirrors.aliyun.com/pypi/packages/d0/30/dc54f88dd4a2b5dc8a0279bdd7270e735851848b762aeb1c1184ed1f6b14/tqdm-4.67.1-py3-none-any.whl (78 kB)\n",
60
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m78.5/78.5 kB\u001b[0m \u001b[31m35.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
61
+ "\u001b[?25hCollecting xxhash (from datasets)\n",
62
+ " Downloading http://mirrors.aliyun.com/pypi/packages/11/a7/81dba5010f7e733de88af9555725146fc133be97ce36533867f4c7e75066/xxhash-3.5.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (194 kB)\n",
63
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m194.4/194.4 kB\u001b[0m \u001b[31m6.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
64
+ "\u001b[?25hCollecting multiprocess<0.70.17 (from datasets)\n",
65
+ " Downloading http://mirrors.aliyun.com/pypi/packages/0a/7d/a988f258104dcd2ccf1ed40fdc97e26c4ac351eeaf81d76e266c52d84e2f/multiprocess-0.70.16-py312-none-any.whl (146 kB)\n",
66
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m146.7/146.7 kB\u001b[0m \u001b[31m4.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
67
+ "\u001b[?25hRequirement already satisfied: fsspec<=2024.9.0,>=2023.1.0 in /root/miniconda3/lib/python3.12/site-packages (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets) (2024.5.0)\n",
68
+ "Collecting aiohttp (from datasets)\n",
69
+ " Downloading http://mirrors.aliyun.com/pypi/packages/40/7f/6de218084f9b653026bd7063cd8045123a7ba90c25176465f266976d8c82/aiohttp-3.11.11-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.7 MB)\n",
70
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m1.7/1.7 MB\u001b[0m \u001b[31m16.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m00:01\u001b[0m\n",
71
+ "\u001b[?25hCollecting aiohappyeyeballs>=2.3.0 (from aiohttp->datasets)\n",
72
+ " Downloading http://mirrors.aliyun.com/pypi/packages/b9/74/fbb6559de3607b3300b9be3cc64e97548d55678e44623db17820dbd20002/aiohappyeyeballs-2.4.4-py3-none-any.whl (14 kB)\n",
73
+ "Collecting aiosignal>=1.1.2 (from aiohttp->datasets)\n",
74
+ " Downloading http://mirrors.aliyun.com/pypi/packages/ec/6a/bc7e17a3e87a2985d3e8f4da4cd0f481060eb78fb08596c42be62c90a4d9/aiosignal-1.3.2-py2.py3-none-any.whl (7.6 kB)\n",
75
+ "Requirement already satisfied: attrs>=17.3.0 in /root/miniconda3/lib/python3.12/site-packages (from aiohttp->datasets) (23.2.0)\n",
76
+ "Collecting frozenlist>=1.1.1 (from aiohttp->datasets)\n",
77
+ " Downloading http://mirrors.aliyun.com/pypi/packages/af/f2/64b73a9bb86f5a89fb55450e97cd5c1f84a862d4ff90d9fd1a73ab0f64a5/frozenlist-1.5.0-cp312-cp312-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (283 kB)\n",
78
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m283.6/283.6 kB\u001b[0m \u001b[31m41.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
79
+ "\u001b[?25hCollecting multidict<7.0,>=4.5 (from aiohttp->datasets)\n",
80
+ " Downloading http://mirrors.aliyun.com/pypi/packages/d3/c8/529101d7176fe7dfe1d99604e48d69c5dfdcadb4f06561f465c8ef12b4df/multidict-6.1.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (131 kB)\n",
81
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━��━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m131.0/131.0 kB\u001b[0m \u001b[31m56.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
82
+ "\u001b[?25hCollecting propcache>=0.2.0 (from aiohttp->datasets)\n",
83
+ " Downloading http://mirrors.aliyun.com/pypi/packages/1c/07/ebe102777a830bca91bbb93e3479cd34c2ca5d0361b83be9dbd93104865e/propcache-0.2.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (243 kB)\n",
84
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m243.6/243.6 kB\u001b[0m \u001b[31m41.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
85
+ "\u001b[?25hCollecting yarl<2.0,>=1.17.0 (from aiohttp->datasets)\n",
86
+ " Downloading http://mirrors.aliyun.com/pypi/packages/1a/e1/a097d5755d3ea8479a42856f51d97eeff7a3a7160593332d98f2709b3580/yarl-1.18.3-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (336 kB)\n",
87
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m336.9/336.9 kB\u001b[0m \u001b[31m41.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
88
+ "\u001b[?25hRequirement already satisfied: typing-extensions>=3.7.4.3 in /root/miniconda3/lib/python3.12/site-packages (from huggingface-hub<1.0,>=0.24.0->transformers) (4.12.2)\n",
89
+ "Requirement already satisfied: annotated-types>=0.6.0 in /root/miniconda3/lib/python3.12/site-packages (from pydantic>=2.0.0->deepspeed) (0.7.0)\n",
90
+ "Requirement already satisfied: pydantic-core==2.27.2 in /root/miniconda3/lib/python3.12/site-packages (from pydantic>=2.0.0->deepspeed) (2.27.2)\n",
91
+ "Requirement already satisfied: charset-normalizer<4,>=2 in /root/miniconda3/lib/python3.12/site-packages (from requests->transformers) (2.0.4)\n",
92
+ "Requirement already satisfied: idna<4,>=2.5 in /root/miniconda3/lib/python3.12/site-packages (from requests->transformers) (3.7)\n",
93
+ "Requirement already satisfied: urllib3<3,>=1.21.1 in /root/miniconda3/lib/python3.12/site-packages (from requests->transformers) (2.1.0)\n",
94
+ "Requirement already satisfied: certifi>=2017.4.17 in /root/miniconda3/lib/python3.12/site-packages (from requests->transformers) (2024.2.2)\n",
95
+ "Requirement already satisfied: sympy in /root/miniconda3/lib/python3.12/site-packages (from torch->deepspeed) (1.12.1)\n",
96
+ "Requirement already satisfied: networkx in /root/miniconda3/lib/python3.12/site-packages (from torch->deepspeed) (3.3)\n",
97
+ "Requirement already satisfied: jinja2 in /root/miniconda3/lib/python3.12/site-packages (from torch->deepspeed) (3.1.4)\n",
98
+ "Requirement already satisfied: nvidia-cuda-nvrtc-cu12==12.1.105 in /root/miniconda3/lib/python3.12/site-packages (from torch->deepspeed) (12.1.105)\n",
99
+ "Requirement already satisfied: nvidia-cuda-runtime-cu12==12.1.105 in /root/miniconda3/lib/python3.12/site-packages (from torch->deepspeed) (12.1.105)\n",
100
+ "Requirement already satisfied: nvidia-cuda-cupti-cu12==12.1.105 in /root/miniconda3/lib/python3.12/site-packages (from torch->deepspeed) (12.1.105)\n",
101
+ "Requirement already satisfied: nvidia-cudnn-cu12==8.9.2.26 in /root/miniconda3/lib/python3.12/site-packages (from torch->deepspeed) (8.9.2.26)\n",
102
+ "Requirement already satisfied: nvidia-cublas-cu12==12.1.3.1 in /root/miniconda3/lib/python3.12/site-packages (from torch->deepspeed) (12.1.3.1)\n",
103
+ "Requirement already satisfied: nvidia-cufft-cu12==11.0.2.54 in /root/miniconda3/lib/python3.12/site-packages (from torch->deepspeed) (11.0.2.54)\n",
104
+ "Requirement already satisfied: nvidia-curand-cu12==10.3.2.106 in /root/miniconda3/lib/python3.12/site-packages (from torch->deepspeed) (10.3.2.106)\n",
105
+ "Requirement already satisfied: nvidia-cusolver-cu12==11.4.5.107 in /root/miniconda3/lib/python3.12/site-packages (from torch->deepspeed) (11.4.5.107)\n",
106
+ "Requirement already satisfied: nvidia-cusparse-cu12==12.1.0.106 in /root/miniconda3/lib/python3.12/site-packages (from torch->deepspeed) (12.1.0.106)\n",
107
+ "Requirement already satisfied: nvidia-nccl-cu12==2.20.5 in /root/miniconda3/lib/python3.12/site-packages (from torch->deepspeed) (2.20.5)\n",
108
+ "Requirement already satisfied: nvidia-nvtx-cu12==12.1.105 in /root/miniconda3/lib/python3.12/site-packages (from torch->deepspeed) (12.1.105)\n",
109
+ "Requirement already satisfied: nvidia-nvjitlink-cu12 in /root/miniconda3/lib/python3.12/site-packages (from nvidia-cusolver-cu12==11.4.5.107->torch->deepspeed) (12.5.40)\n",
110
+ "Requirement already satisfied: soupsieve>1.2 in /root/miniconda3/lib/python3.12/site-packages (from beautifulsoup4->google) (2.5)\n",
111
+ "Requirement already satisfied: python-dateutil>=2.8.2 in /root/miniconda3/lib/python3.12/site-packages (from pandas->datasets) (2.9.0.post0)\n",
112
+ "Collecting pytz>=2020.1 (from pandas->datasets)\n",
113
+ " Downloading http://mirrors.aliyun.com/pypi/packages/11/c3/005fcca25ce078d2cc29fd559379817424e94885510568bc1bc53d7d5846/pytz-2024.2-py2.py3-none-any.whl (508 kB)\n",
114
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m508.0/508.0 kB\u001b[0m \u001b[31m38.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
115
+ "\u001b[?25hCollecting tzdata>=2022.7 (from pandas->datasets)\n",
116
+ " Downloading http://mirrors.aliyun.com/pypi/packages/a6/ab/7e5f53c3b9d14972843a647d8d7a853969a58aecc7559cb3267302c94774/tzdata-2024.2-py2.py3-none-any.whl (346 kB)\n",
117
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m346.6/346.6 kB\u001b[0m \u001b[31m36.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
118
+ "\u001b[?25hRequirement already satisfied: six>=1.5 in /root/miniconda3/lib/python3.12/site-packages (from python-dateutil>=2.8.2->pandas->datasets) (1.16.0)\n",
119
+ "Requirement already satisfied: MarkupSafe>=2.0 in /root/miniconda3/lib/python3.12/site-packages (from jinja2->torch->deepspeed) (2.1.5)\n",
120
+ "Requirement already satisfied: mpmath<1.4.0,>=1.1.0 in /root/miniconda3/lib/python3.12/site-packages (from sympy->torch->deepspeed) (1.3.0)\n",
121
+ "Installing collected packages: pytz, xxhash, tzdata, tqdm, requests, pyarrow, propcache, multidict, frozenlist, dill, aiohappyeyeballs, yarl, pandas, multiprocess, aiosignal, aiohttp, datasets\n",
122
+ " Attempting uninstall: tqdm\n",
123
+ " Found existing installation: tqdm 4.66.2\n",
124
+ " Uninstalling tqdm-4.66.2:\n",
125
+ " Successfully uninstalled tqdm-4.66.2\n",
126
+ " Attempting uninstall: requests\n",
127
+ " Found existing installation: requests 2.31.0\n",
128
+ " Uninstalling requests-2.31.0:\n",
129
+ " Successfully uninstalled requests-2.31.0\n",
130
+ "Successfully installed aiohappyeyeballs-2.4.4 aiohttp-3.11.11 aiosignal-1.3.2 datasets-3.2.0 dill-0.3.8 frozenlist-1.5.0 multidict-6.1.0 multiprocess-0.70.16 pandas-2.2.3 propcache-0.2.1 pyarrow-18.1.0 pytz-2024.2 requests-2.32.3 tqdm-4.67.1 tzdata-2024.2 xxhash-3.5.0 yarl-1.18.3\n",
131
+ "\u001b[33mWARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv\u001b[0m\u001b[33m\n",
132
+ "\u001b[0m"
133
+ ]
134
+ }
135
+ ],
136
+ "source": [
137
+ "!pip install transformers sentencepiece google protobuf deepspeed peft datasets "
138
+ ]
139
+ },
140
+ {
141
+ "cell_type": "code",
142
+ "execution_count": 9,
143
+ "id": "4e906370-40c7-4f6b-a700-f183a9308c78",
144
+ "metadata": {},
145
+ "outputs": [
146
+ {
147
+ "name": "stdout",
148
+ "output_type": "stream",
149
+ "text": [
150
+ "https://hf-mirror.com\n"
151
+ ]
152
+ }
153
+ ],
154
+ "source": [
155
+ "import os\n",
156
+ "\n",
157
+ "# 设置环境变量, autodl专区 其他idc\n",
158
+ "os.environ['HF_ENDPOINT'] = 'https://hf-mirror.com'\n",
159
+ "\n",
160
+ "# 打印环境变量以确认设置成功\n",
161
+ "print(os.environ.get('HF_ENDPOINT'))"
162
+ ]
163
+ },
164
+ {
165
+ "cell_type": "code",
166
+ "execution_count": 1,
167
+ "id": "ecc98529-6581-41d2-a876-23ce5187cae7",
168
+ "metadata": {},
169
+ "outputs": [],
170
+ "source": [
171
+ "import subprocess\n",
172
+ "import os\n",
173
+ "# 设置环境变量, autodl一般区域\n",
174
+ "result = subprocess.run('bash -c \"source /etc/network_turbo && env | grep proxy\"', shell=True, capture_output=True, text=True)\n",
175
+ "output = result.stdout\n",
176
+ "for line in output.splitlines():\n",
177
+ " if '=' in line:\n",
178
+ " var, value = line.split('=', 1)\n",
179
+ " os.environ[var] = value"
180
+ ]
181
+ },
182
+ {
183
+ "cell_type": "code",
184
+ "execution_count": 2,
185
+ "id": "b01fc372-33af-46e5-8c0e-8bccba7237ee",
186
+ "metadata": {},
187
+ "outputs": [],
188
+ "source": [
189
+ "from datasets import load_dataset\n",
190
+ "# load ~11k samples from promoters prediction dataset\n",
191
+ "dataset = load_dataset(\"dnagpt/dna_core_promoter\")['train'].train_test_split(test_size=0.1)"
192
+ ]
193
+ },
194
+ {
195
+ "cell_type": "code",
196
+ "execution_count": 3,
197
+ "id": "136c38d4-bd0f-4ecd-9165-2fd5b5207c1d",
198
+ "metadata": {},
199
+ "outputs": [
200
+ {
201
+ "data": {
202
+ "text/plain": [
203
+ "DatasetDict({\n",
204
+ " train: Dataset({\n",
205
+ " features: ['sequence', 'label'],\n",
206
+ " num_rows: 53276\n",
207
+ " })\n",
208
+ " test: Dataset({\n",
209
+ " features: ['sequence', 'label'],\n",
210
+ " num_rows: 5920\n",
211
+ " })\n",
212
+ "})"
213
+ ]
214
+ },
215
+ "execution_count": 3,
216
+ "metadata": {},
217
+ "output_type": "execute_result"
218
+ }
219
+ ],
220
+ "source": [
221
+ "dataset"
222
+ ]
223
+ },
224
+ {
225
+ "cell_type": "markdown",
226
+ "id": "28acb64e-8d1e-4482-a515-344a2e0344c4",
227
+ "metadata": {},
228
+ "source": [
229
+ "## lfs 支持\n",
230
+ "apt-get update\n",
231
+ "\n",
232
+ "apt-get install git-lfs\n",
233
+ "\n",
234
+ "git lfs install"
235
+ ]
236
+ },
237
+ {
238
+ "cell_type": "code",
239
+ "execution_count": null,
240
+ "id": "3d3cefb0-1eed-4f23-8591-1990f7113820",
241
+ "metadata": {},
242
+ "outputs": [],
243
+ "source": []
244
+ }
245
+ ],
246
+ "metadata": {
247
+ "kernelspec": {
248
+ "display_name": "Python 3 (ipykernel)",
249
+ "language": "python",
250
+ "name": "python3"
251
+ },
252
+ "language_info": {
253
+ "codemirror_mode": {
254
+ "name": "ipython",
255
+ "version": 3
256
+ },
257
+ "file_extension": ".py",
258
+ "mimetype": "text/x-python",
259
+ "name": "python",
260
+ "nbconvert_exporter": "python",
261
+ "pygments_lexer": "ipython3",
262
+ "version": "3.12.3"
263
+ }
264
+ },
265
+ "nbformat": 4,
266
+ "nbformat_minor": 5
267
+ }
03-gene-task/.ipynb_checkpoints/env_ini-checkpoint.ipynb ADDED
@@ -0,0 +1,267 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "code",
5
+ "execution_count": 5,
6
+ "id": "e3fbdac5-cd38-4e41-b5d2-d9d112b4ac1b",
7
+ "metadata": {
8
+ "scrolled": true
9
+ },
10
+ "outputs": [
11
+ {
12
+ "name": "stdout",
13
+ "output_type": "stream",
14
+ "text": [
15
+ "Looking in indexes: http://mirrors.aliyun.com/pypi/simple\n",
16
+ "Requirement already satisfied: transformers in /root/miniconda3/lib/python3.12/site-packages (4.47.1)\n",
17
+ "Requirement already satisfied: sentencepiece in /root/miniconda3/lib/python3.12/site-packages (0.2.0)\n",
18
+ "Requirement already satisfied: google in /root/miniconda3/lib/python3.12/site-packages (3.0.0)\n",
19
+ "Requirement already satisfied: protobuf in /root/miniconda3/lib/python3.12/site-packages (5.27.0)\n",
20
+ "Requirement already satisfied: deepspeed in /root/miniconda3/lib/python3.12/site-packages (0.16.2)\n",
21
+ "Requirement already satisfied: peft in /root/miniconda3/lib/python3.12/site-packages (0.14.0)\n",
22
+ "Collecting datasets\n",
23
+ " Downloading http://mirrors.aliyun.com/pypi/packages/d7/84/0df6c5981f5fc722381662ff8cfbdf8aad64bec875f75d80b55bfef394ce/datasets-3.2.0-py3-none-any.whl (480 kB)\n",
24
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m480.6/480.6 kB\u001b[0m \u001b[31m4.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m00:01\u001b[0m\n",
25
+ "\u001b[?25hRequirement already satisfied: filelock in /root/miniconda3/lib/python3.12/site-packages (from transformers) (3.14.0)\n",
26
+ "Requirement already satisfied: huggingface-hub<1.0,>=0.24.0 in /root/miniconda3/lib/python3.12/site-packages (from transformers) (0.27.0)\n",
27
+ "Requirement already satisfied: numpy>=1.17 in /root/miniconda3/lib/python3.12/site-packages (from transformers) (1.26.4)\n",
28
+ "Requirement already satisfied: packaging>=20.0 in /root/miniconda3/lib/python3.12/site-packages (from transformers) (23.2)\n",
29
+ "Requirement already satisfied: pyyaml>=5.1 in /root/miniconda3/lib/python3.12/site-packages (from transformers) (6.0.1)\n",
30
+ "Requirement already satisfied: regex!=2019.12.17 in /root/miniconda3/lib/python3.12/site-packages (from transformers) (2024.11.6)\n",
31
+ "Requirement already satisfied: requests in /root/miniconda3/lib/python3.12/site-packages (from transformers) (2.31.0)\n",
32
+ "Requirement already satisfied: tokenizers<0.22,>=0.21 in /root/miniconda3/lib/python3.12/site-packages (from transformers) (0.21.0)\n",
33
+ "Requirement already satisfied: safetensors>=0.4.1 in /root/miniconda3/lib/python3.12/site-packages (from transformers) (0.4.5)\n",
34
+ "Requirement already satisfied: tqdm>=4.27 in /root/miniconda3/lib/python3.12/site-packages (from transformers) (4.66.2)\n",
35
+ "Requirement already satisfied: beautifulsoup4 in /root/miniconda3/lib/python3.12/site-packages (from google) (4.12.3)\n",
36
+ "Requirement already satisfied: einops in /root/miniconda3/lib/python3.12/site-packages (from deepspeed) (0.8.0)\n",
37
+ "Requirement already satisfied: hjson in /root/miniconda3/lib/python3.12/site-packages (from deepspeed) (3.1.0)\n",
38
+ "Requirement already satisfied: msgpack in /root/miniconda3/lib/python3.12/site-packages (from deepspeed) (1.1.0)\n",
39
+ "Requirement already satisfied: ninja in /root/miniconda3/lib/python3.12/site-packages (from deepspeed) (1.11.1.3)\n",
40
+ "Requirement already satisfied: psutil in /root/miniconda3/lib/python3.12/site-packages (from deepspeed) (5.9.8)\n",
41
+ "Requirement already satisfied: py-cpuinfo in /root/miniconda3/lib/python3.12/site-packages (from deepspeed) (9.0.0)\n",
42
+ "Requirement already satisfied: pydantic>=2.0.0 in /root/miniconda3/lib/python3.12/site-packages (from deepspeed) (2.10.4)\n",
43
+ "Requirement already satisfied: torch in /root/miniconda3/lib/python3.12/site-packages (from deepspeed) (2.3.0+cu121)\n",
44
+ "Requirement already satisfied: nvidia-ml-py in /root/miniconda3/lib/python3.12/site-packages (from deepspeed) (12.560.30)\n",
45
+ "Requirement already satisfied: accelerate>=0.21.0 in /root/miniconda3/lib/python3.12/site-packages (from peft) (1.2.1)\n",
46
+ "Collecting pyarrow>=15.0.0 (from datasets)\n",
47
+ " Downloading http://mirrors.aliyun.com/pypi/packages/3a/2e/3b99f8a3d9e0ccae0e961978a0d0089b25fb46ebbcfb5ebae3cca179a5b3/pyarrow-18.1.0-cp312-cp312-manylinux_2_28_x86_64.whl (40.1 MB)\n",
48
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m40.1/40.1 MB\u001b[0m \u001b[31m14.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m00:01\u001b[0m\n",
49
+ "\u001b[?25hCollecting dill<0.3.9,>=0.3.0 (from datasets)\n",
50
+ " Downloading http://mirrors.aliyun.com/pypi/packages/c9/7a/cef76fd8438a42f96db64ddaa85280485a9c395e7df3db8158cfec1eee34/dill-0.3.8-py3-none-any.whl (116 kB)\n",
51
+ "\u001b[2K \u001b[90m━━━━━━━���━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m116.3/116.3 kB\u001b[0m \u001b[31m53.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
52
+ "\u001b[?25hCollecting pandas (from datasets)\n",
53
+ " Downloading http://mirrors.aliyun.com/pypi/packages/38/f8/d8fddee9ed0d0c0f4a2132c1dfcf0e3e53265055da8df952a53e7eaf178c/pandas-2.2.3-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.7 MB)\n",
54
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m12.7/12.7 MB\u001b[0m \u001b[31m13.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m00:01\u001b[0m\n",
55
+ "\u001b[?25hCollecting requests (from transformers)\n",
56
+ " Downloading http://mirrors.aliyun.com/pypi/packages/f9/9b/335f9764261e915ed497fcdeb11df5dfd6f7bf257d4a6a2a686d80da4d54/requests-2.32.3-py3-none-any.whl (64 kB)\n",
57
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m64.9/64.9 kB\u001b[0m \u001b[31m31.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
58
+ "\u001b[?25hCollecting tqdm>=4.27 (from transformers)\n",
59
+ " Downloading http://mirrors.aliyun.com/pypi/packages/d0/30/dc54f88dd4a2b5dc8a0279bdd7270e735851848b762aeb1c1184ed1f6b14/tqdm-4.67.1-py3-none-any.whl (78 kB)\n",
60
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m78.5/78.5 kB\u001b[0m \u001b[31m35.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
61
+ "\u001b[?25hCollecting xxhash (from datasets)\n",
62
+ " Downloading http://mirrors.aliyun.com/pypi/packages/11/a7/81dba5010f7e733de88af9555725146fc133be97ce36533867f4c7e75066/xxhash-3.5.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (194 kB)\n",
63
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m194.4/194.4 kB\u001b[0m \u001b[31m6.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
64
+ "\u001b[?25hCollecting multiprocess<0.70.17 (from datasets)\n",
65
+ " Downloading http://mirrors.aliyun.com/pypi/packages/0a/7d/a988f258104dcd2ccf1ed40fdc97e26c4ac351eeaf81d76e266c52d84e2f/multiprocess-0.70.16-py312-none-any.whl (146 kB)\n",
66
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m146.7/146.7 kB\u001b[0m \u001b[31m4.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
67
+ "\u001b[?25hRequirement already satisfied: fsspec<=2024.9.0,>=2023.1.0 in /root/miniconda3/lib/python3.12/site-packages (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets) (2024.5.0)\n",
68
+ "Collecting aiohttp (from datasets)\n",
69
+ " Downloading http://mirrors.aliyun.com/pypi/packages/40/7f/6de218084f9b653026bd7063cd8045123a7ba90c25176465f266976d8c82/aiohttp-3.11.11-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.7 MB)\n",
70
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m1.7/1.7 MB\u001b[0m \u001b[31m16.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m00:01\u001b[0m\n",
71
+ "\u001b[?25hCollecting aiohappyeyeballs>=2.3.0 (from aiohttp->datasets)\n",
72
+ " Downloading http://mirrors.aliyun.com/pypi/packages/b9/74/fbb6559de3607b3300b9be3cc64e97548d55678e44623db17820dbd20002/aiohappyeyeballs-2.4.4-py3-none-any.whl (14 kB)\n",
73
+ "Collecting aiosignal>=1.1.2 (from aiohttp->datasets)\n",
74
+ " Downloading http://mirrors.aliyun.com/pypi/packages/ec/6a/bc7e17a3e87a2985d3e8f4da4cd0f481060eb78fb08596c42be62c90a4d9/aiosignal-1.3.2-py2.py3-none-any.whl (7.6 kB)\n",
75
+ "Requirement already satisfied: attrs>=17.3.0 in /root/miniconda3/lib/python3.12/site-packages (from aiohttp->datasets) (23.2.0)\n",
76
+ "Collecting frozenlist>=1.1.1 (from aiohttp->datasets)\n",
77
+ " Downloading http://mirrors.aliyun.com/pypi/packages/af/f2/64b73a9bb86f5a89fb55450e97cd5c1f84a862d4ff90d9fd1a73ab0f64a5/frozenlist-1.5.0-cp312-cp312-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (283 kB)\n",
78
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m283.6/283.6 kB\u001b[0m \u001b[31m41.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
79
+ "\u001b[?25hCollecting multidict<7.0,>=4.5 (from aiohttp->datasets)\n",
80
+ " Downloading http://mirrors.aliyun.com/pypi/packages/d3/c8/529101d7176fe7dfe1d99604e48d69c5dfdcadb4f06561f465c8ef12b4df/multidict-6.1.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (131 kB)\n",
81
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━��━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m131.0/131.0 kB\u001b[0m \u001b[31m56.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
82
+ "\u001b[?25hCollecting propcache>=0.2.0 (from aiohttp->datasets)\n",
83
+ " Downloading http://mirrors.aliyun.com/pypi/packages/1c/07/ebe102777a830bca91bbb93e3479cd34c2ca5d0361b83be9dbd93104865e/propcache-0.2.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (243 kB)\n",
84
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m243.6/243.6 kB\u001b[0m \u001b[31m41.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
85
+ "\u001b[?25hCollecting yarl<2.0,>=1.17.0 (from aiohttp->datasets)\n",
86
+ " Downloading http://mirrors.aliyun.com/pypi/packages/1a/e1/a097d5755d3ea8479a42856f51d97eeff7a3a7160593332d98f2709b3580/yarl-1.18.3-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (336 kB)\n",
87
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m336.9/336.9 kB\u001b[0m \u001b[31m41.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
88
+ "\u001b[?25hRequirement already satisfied: typing-extensions>=3.7.4.3 in /root/miniconda3/lib/python3.12/site-packages (from huggingface-hub<1.0,>=0.24.0->transformers) (4.12.2)\n",
89
+ "Requirement already satisfied: annotated-types>=0.6.0 in /root/miniconda3/lib/python3.12/site-packages (from pydantic>=2.0.0->deepspeed) (0.7.0)\n",
90
+ "Requirement already satisfied: pydantic-core==2.27.2 in /root/miniconda3/lib/python3.12/site-packages (from pydantic>=2.0.0->deepspeed) (2.27.2)\n",
91
+ "Requirement already satisfied: charset-normalizer<4,>=2 in /root/miniconda3/lib/python3.12/site-packages (from requests->transformers) (2.0.4)\n",
92
+ "Requirement already satisfied: idna<4,>=2.5 in /root/miniconda3/lib/python3.12/site-packages (from requests->transformers) (3.7)\n",
93
+ "Requirement already satisfied: urllib3<3,>=1.21.1 in /root/miniconda3/lib/python3.12/site-packages (from requests->transformers) (2.1.0)\n",
94
+ "Requirement already satisfied: certifi>=2017.4.17 in /root/miniconda3/lib/python3.12/site-packages (from requests->transformers) (2024.2.2)\n",
95
+ "Requirement already satisfied: sympy in /root/miniconda3/lib/python3.12/site-packages (from torch->deepspeed) (1.12.1)\n",
96
+ "Requirement already satisfied: networkx in /root/miniconda3/lib/python3.12/site-packages (from torch->deepspeed) (3.3)\n",
97
+ "Requirement already satisfied: jinja2 in /root/miniconda3/lib/python3.12/site-packages (from torch->deepspeed) (3.1.4)\n",
98
+ "Requirement already satisfied: nvidia-cuda-nvrtc-cu12==12.1.105 in /root/miniconda3/lib/python3.12/site-packages (from torch->deepspeed) (12.1.105)\n",
99
+ "Requirement already satisfied: nvidia-cuda-runtime-cu12==12.1.105 in /root/miniconda3/lib/python3.12/site-packages (from torch->deepspeed) (12.1.105)\n",
100
+ "Requirement already satisfied: nvidia-cuda-cupti-cu12==12.1.105 in /root/miniconda3/lib/python3.12/site-packages (from torch->deepspeed) (12.1.105)\n",
101
+ "Requirement already satisfied: nvidia-cudnn-cu12==8.9.2.26 in /root/miniconda3/lib/python3.12/site-packages (from torch->deepspeed) (8.9.2.26)\n",
102
+ "Requirement already satisfied: nvidia-cublas-cu12==12.1.3.1 in /root/miniconda3/lib/python3.12/site-packages (from torch->deepspeed) (12.1.3.1)\n",
103
+ "Requirement already satisfied: nvidia-cufft-cu12==11.0.2.54 in /root/miniconda3/lib/python3.12/site-packages (from torch->deepspeed) (11.0.2.54)\n",
104
+ "Requirement already satisfied: nvidia-curand-cu12==10.3.2.106 in /root/miniconda3/lib/python3.12/site-packages (from torch->deepspeed) (10.3.2.106)\n",
105
+ "Requirement already satisfied: nvidia-cusolver-cu12==11.4.5.107 in /root/miniconda3/lib/python3.12/site-packages (from torch->deepspeed) (11.4.5.107)\n",
106
+ "Requirement already satisfied: nvidia-cusparse-cu12==12.1.0.106 in /root/miniconda3/lib/python3.12/site-packages (from torch->deepspeed) (12.1.0.106)\n",
107
+ "Requirement already satisfied: nvidia-nccl-cu12==2.20.5 in /root/miniconda3/lib/python3.12/site-packages (from torch->deepspeed) (2.20.5)\n",
108
+ "Requirement already satisfied: nvidia-nvtx-cu12==12.1.105 in /root/miniconda3/lib/python3.12/site-packages (from torch->deepspeed) (12.1.105)\n",
109
+ "Requirement already satisfied: nvidia-nvjitlink-cu12 in /root/miniconda3/lib/python3.12/site-packages (from nvidia-cusolver-cu12==11.4.5.107->torch->deepspeed) (12.5.40)\n",
110
+ "Requirement already satisfied: soupsieve>1.2 in /root/miniconda3/lib/python3.12/site-packages (from beautifulsoup4->google) (2.5)\n",
111
+ "Requirement already satisfied: python-dateutil>=2.8.2 in /root/miniconda3/lib/python3.12/site-packages (from pandas->datasets) (2.9.0.post0)\n",
112
+ "Collecting pytz>=2020.1 (from pandas->datasets)\n",
113
+ " Downloading http://mirrors.aliyun.com/pypi/packages/11/c3/005fcca25ce078d2cc29fd559379817424e94885510568bc1bc53d7d5846/pytz-2024.2-py2.py3-none-any.whl (508 kB)\n",
114
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m508.0/508.0 kB\u001b[0m \u001b[31m38.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
115
+ "\u001b[?25hCollecting tzdata>=2022.7 (from pandas->datasets)\n",
116
+ " Downloading http://mirrors.aliyun.com/pypi/packages/a6/ab/7e5f53c3b9d14972843a647d8d7a853969a58aecc7559cb3267302c94774/tzdata-2024.2-py2.py3-none-any.whl (346 kB)\n",
117
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m346.6/346.6 kB\u001b[0m \u001b[31m36.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
118
+ "\u001b[?25hRequirement already satisfied: six>=1.5 in /root/miniconda3/lib/python3.12/site-packages (from python-dateutil>=2.8.2->pandas->datasets) (1.16.0)\n",
119
+ "Requirement already satisfied: MarkupSafe>=2.0 in /root/miniconda3/lib/python3.12/site-packages (from jinja2->torch->deepspeed) (2.1.5)\n",
120
+ "Requirement already satisfied: mpmath<1.4.0,>=1.1.0 in /root/miniconda3/lib/python3.12/site-packages (from sympy->torch->deepspeed) (1.3.0)\n",
121
+ "Installing collected packages: pytz, xxhash, tzdata, tqdm, requests, pyarrow, propcache, multidict, frozenlist, dill, aiohappyeyeballs, yarl, pandas, multiprocess, aiosignal, aiohttp, datasets\n",
122
+ " Attempting uninstall: tqdm\n",
123
+ " Found existing installation: tqdm 4.66.2\n",
124
+ " Uninstalling tqdm-4.66.2:\n",
125
+ " Successfully uninstalled tqdm-4.66.2\n",
126
+ " Attempting uninstall: requests\n",
127
+ " Found existing installation: requests 2.31.0\n",
128
+ " Uninstalling requests-2.31.0:\n",
129
+ " Successfully uninstalled requests-2.31.0\n",
130
+ "Successfully installed aiohappyeyeballs-2.4.4 aiohttp-3.11.11 aiosignal-1.3.2 datasets-3.2.0 dill-0.3.8 frozenlist-1.5.0 multidict-6.1.0 multiprocess-0.70.16 pandas-2.2.3 propcache-0.2.1 pyarrow-18.1.0 pytz-2024.2 requests-2.32.3 tqdm-4.67.1 tzdata-2024.2 xxhash-3.5.0 yarl-1.18.3\n",
131
+ "\u001b[33mWARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv\u001b[0m\u001b[33m\n",
132
+ "\u001b[0m"
133
+ ]
134
+ }
135
+ ],
136
+ "source": [
137
+ "!pip install transformers sentencepiece google protobuf deepspeed peft datasets "
138
+ ]
139
+ },
140
+ {
141
+ "cell_type": "code",
142
+ "execution_count": 9,
143
+ "id": "4e906370-40c7-4f6b-a700-f183a9308c78",
144
+ "metadata": {},
145
+ "outputs": [
146
+ {
147
+ "name": "stdout",
148
+ "output_type": "stream",
149
+ "text": [
150
+ "https://hf-mirror.com\n"
151
+ ]
152
+ }
153
+ ],
154
+ "source": [
155
+ "import os\n",
156
+ "\n",
157
+ "# 设置环境变量, autodl专区 其他idc\n",
158
+ "os.environ['HF_ENDPOINT'] = 'https://hf-mirror.com'\n",
159
+ "\n",
160
+ "# 打印环境变量以确认设置成功\n",
161
+ "print(os.environ.get('HF_ENDPOINT'))"
162
+ ]
163
+ },
164
+ {
165
+ "cell_type": "code",
166
+ "execution_count": 1,
167
+ "id": "ecc98529-6581-41d2-a876-23ce5187cae7",
168
+ "metadata": {},
169
+ "outputs": [],
170
+ "source": [
171
+ "import subprocess\n",
172
+ "import os\n",
173
+ "# 设置环境变量, autodl一般区域\n",
174
+ "result = subprocess.run('bash -c \"source /etc/network_turbo && env | grep proxy\"', shell=True, capture_output=True, text=True)\n",
175
+ "output = result.stdout\n",
176
+ "for line in output.splitlines():\n",
177
+ " if '=' in line:\n",
178
+ " var, value = line.split('=', 1)\n",
179
+ " os.environ[var] = value"
180
+ ]
181
+ },
182
+ {
183
+ "cell_type": "code",
184
+ "execution_count": 2,
185
+ "id": "b01fc372-33af-46e5-8c0e-8bccba7237ee",
186
+ "metadata": {},
187
+ "outputs": [],
188
+ "source": [
189
+ "from datasets import load_dataset\n",
190
+ "# load ~11k samples from promoters prediction dataset\n",
191
+ "dataset = load_dataset(\"dnagpt/dna_core_promoter\")['train'].train_test_split(test_size=0.1)"
192
+ ]
193
+ },
194
+ {
195
+ "cell_type": "code",
196
+ "execution_count": 3,
197
+ "id": "136c38d4-bd0f-4ecd-9165-2fd5b5207c1d",
198
+ "metadata": {},
199
+ "outputs": [
200
+ {
201
+ "data": {
202
+ "text/plain": [
203
+ "DatasetDict({\n",
204
+ " train: Dataset({\n",
205
+ " features: ['sequence', 'label'],\n",
206
+ " num_rows: 53276\n",
207
+ " })\n",
208
+ " test: Dataset({\n",
209
+ " features: ['sequence', 'label'],\n",
210
+ " num_rows: 5920\n",
211
+ " })\n",
212
+ "})"
213
+ ]
214
+ },
215
+ "execution_count": 3,
216
+ "metadata": {},
217
+ "output_type": "execute_result"
218
+ }
219
+ ],
220
+ "source": [
221
+ "dataset"
222
+ ]
223
+ },
224
+ {
225
+ "cell_type": "markdown",
226
+ "id": "28acb64e-8d1e-4482-a515-344a2e0344c4",
227
+ "metadata": {},
228
+ "source": [
229
+ "## lfs 支持\n",
230
+ "apt-get update\n",
231
+ "\n",
232
+ "apt-get install git-lfs\n",
233
+ "\n",
234
+ "git lfs install"
235
+ ]
236
+ },
237
+ {
238
+ "cell_type": "code",
239
+ "execution_count": null,
240
+ "id": "3d3cefb0-1eed-4f23-8591-1990f7113820",
241
+ "metadata": {},
242
+ "outputs": [],
243
+ "source": []
244
+ }
245
+ ],
246
+ "metadata": {
247
+ "kernelspec": {
248
+ "display_name": "Python 3 (ipykernel)",
249
+ "language": "python",
250
+ "name": "python3"
251
+ },
252
+ "language_info": {
253
+ "codemirror_mode": {
254
+ "name": "ipython",
255
+ "version": 3
256
+ },
257
+ "file_extension": ".py",
258
+ "mimetype": "text/x-python",
259
+ "name": "python",
260
+ "nbconvert_exporter": "python",
261
+ "pygments_lexer": "ipython3",
262
+ "version": "3.12.3"
263
+ }
264
+ },
265
+ "nbformat": 4,
266
+ "nbformat_minor": 5
267
+ }
03-gene-task/env_ini.ipynb ADDED
@@ -0,0 +1,267 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "code",
5
+ "execution_count": 5,
6
+ "id": "e3fbdac5-cd38-4e41-b5d2-d9d112b4ac1b",
7
+ "metadata": {
8
+ "scrolled": true
9
+ },
10
+ "outputs": [
11
+ {
12
+ "name": "stdout",
13
+ "output_type": "stream",
14
+ "text": [
15
+ "Looking in indexes: http://mirrors.aliyun.com/pypi/simple\n",
16
+ "Requirement already satisfied: transformers in /root/miniconda3/lib/python3.12/site-packages (4.47.1)\n",
17
+ "Requirement already satisfied: sentencepiece in /root/miniconda3/lib/python3.12/site-packages (0.2.0)\n",
18
+ "Requirement already satisfied: google in /root/miniconda3/lib/python3.12/site-packages (3.0.0)\n",
19
+ "Requirement already satisfied: protobuf in /root/miniconda3/lib/python3.12/site-packages (5.27.0)\n",
20
+ "Requirement already satisfied: deepspeed in /root/miniconda3/lib/python3.12/site-packages (0.16.2)\n",
21
+ "Requirement already satisfied: peft in /root/miniconda3/lib/python3.12/site-packages (0.14.0)\n",
22
+ "Collecting datasets\n",
23
+ " Downloading http://mirrors.aliyun.com/pypi/packages/d7/84/0df6c5981f5fc722381662ff8cfbdf8aad64bec875f75d80b55bfef394ce/datasets-3.2.0-py3-none-any.whl (480 kB)\n",
24
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m480.6/480.6 kB\u001b[0m \u001b[31m4.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m00:01\u001b[0m\n",
25
+ "\u001b[?25hRequirement already satisfied: filelock in /root/miniconda3/lib/python3.12/site-packages (from transformers) (3.14.0)\n",
26
+ "Requirement already satisfied: huggingface-hub<1.0,>=0.24.0 in /root/miniconda3/lib/python3.12/site-packages (from transformers) (0.27.0)\n",
27
+ "Requirement already satisfied: numpy>=1.17 in /root/miniconda3/lib/python3.12/site-packages (from transformers) (1.26.4)\n",
28
+ "Requirement already satisfied: packaging>=20.0 in /root/miniconda3/lib/python3.12/site-packages (from transformers) (23.2)\n",
29
+ "Requirement already satisfied: pyyaml>=5.1 in /root/miniconda3/lib/python3.12/site-packages (from transformers) (6.0.1)\n",
30
+ "Requirement already satisfied: regex!=2019.12.17 in /root/miniconda3/lib/python3.12/site-packages (from transformers) (2024.11.6)\n",
31
+ "Requirement already satisfied: requests in /root/miniconda3/lib/python3.12/site-packages (from transformers) (2.31.0)\n",
32
+ "Requirement already satisfied: tokenizers<0.22,>=0.21 in /root/miniconda3/lib/python3.12/site-packages (from transformers) (0.21.0)\n",
33
+ "Requirement already satisfied: safetensors>=0.4.1 in /root/miniconda3/lib/python3.12/site-packages (from transformers) (0.4.5)\n",
34
+ "Requirement already satisfied: tqdm>=4.27 in /root/miniconda3/lib/python3.12/site-packages (from transformers) (4.66.2)\n",
35
+ "Requirement already satisfied: beautifulsoup4 in /root/miniconda3/lib/python3.12/site-packages (from google) (4.12.3)\n",
36
+ "Requirement already satisfied: einops in /root/miniconda3/lib/python3.12/site-packages (from deepspeed) (0.8.0)\n",
37
+ "Requirement already satisfied: hjson in /root/miniconda3/lib/python3.12/site-packages (from deepspeed) (3.1.0)\n",
38
+ "Requirement already satisfied: msgpack in /root/miniconda3/lib/python3.12/site-packages (from deepspeed) (1.1.0)\n",
39
+ "Requirement already satisfied: ninja in /root/miniconda3/lib/python3.12/site-packages (from deepspeed) (1.11.1.3)\n",
40
+ "Requirement already satisfied: psutil in /root/miniconda3/lib/python3.12/site-packages (from deepspeed) (5.9.8)\n",
41
+ "Requirement already satisfied: py-cpuinfo in /root/miniconda3/lib/python3.12/site-packages (from deepspeed) (9.0.0)\n",
42
+ "Requirement already satisfied: pydantic>=2.0.0 in /root/miniconda3/lib/python3.12/site-packages (from deepspeed) (2.10.4)\n",
43
+ "Requirement already satisfied: torch in /root/miniconda3/lib/python3.12/site-packages (from deepspeed) (2.3.0+cu121)\n",
44
+ "Requirement already satisfied: nvidia-ml-py in /root/miniconda3/lib/python3.12/site-packages (from deepspeed) (12.560.30)\n",
45
+ "Requirement already satisfied: accelerate>=0.21.0 in /root/miniconda3/lib/python3.12/site-packages (from peft) (1.2.1)\n",
46
+ "Collecting pyarrow>=15.0.0 (from datasets)\n",
47
+ " Downloading http://mirrors.aliyun.com/pypi/packages/3a/2e/3b99f8a3d9e0ccae0e961978a0d0089b25fb46ebbcfb5ebae3cca179a5b3/pyarrow-18.1.0-cp312-cp312-manylinux_2_28_x86_64.whl (40.1 MB)\n",
48
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m40.1/40.1 MB\u001b[0m \u001b[31m14.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m00:01\u001b[0m\n",
49
+ "\u001b[?25hCollecting dill<0.3.9,>=0.3.0 (from datasets)\n",
50
+ " Downloading http://mirrors.aliyun.com/pypi/packages/c9/7a/cef76fd8438a42f96db64ddaa85280485a9c395e7df3db8158cfec1eee34/dill-0.3.8-py3-none-any.whl (116 kB)\n",
51
+ "\u001b[2K \u001b[90m━━━━━━━���━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m116.3/116.3 kB\u001b[0m \u001b[31m53.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
52
+ "\u001b[?25hCollecting pandas (from datasets)\n",
53
+ " Downloading http://mirrors.aliyun.com/pypi/packages/38/f8/d8fddee9ed0d0c0f4a2132c1dfcf0e3e53265055da8df952a53e7eaf178c/pandas-2.2.3-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.7 MB)\n",
54
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m12.7/12.7 MB\u001b[0m \u001b[31m13.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m00:01\u001b[0m\n",
55
+ "\u001b[?25hCollecting requests (from transformers)\n",
56
+ " Downloading http://mirrors.aliyun.com/pypi/packages/f9/9b/335f9764261e915ed497fcdeb11df5dfd6f7bf257d4a6a2a686d80da4d54/requests-2.32.3-py3-none-any.whl (64 kB)\n",
57
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m64.9/64.9 kB\u001b[0m \u001b[31m31.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
58
+ "\u001b[?25hCollecting tqdm>=4.27 (from transformers)\n",
59
+ " Downloading http://mirrors.aliyun.com/pypi/packages/d0/30/dc54f88dd4a2b5dc8a0279bdd7270e735851848b762aeb1c1184ed1f6b14/tqdm-4.67.1-py3-none-any.whl (78 kB)\n",
60
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m78.5/78.5 kB\u001b[0m \u001b[31m35.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
61
+ "\u001b[?25hCollecting xxhash (from datasets)\n",
62
+ " Downloading http://mirrors.aliyun.com/pypi/packages/11/a7/81dba5010f7e733de88af9555725146fc133be97ce36533867f4c7e75066/xxhash-3.5.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (194 kB)\n",
63
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m194.4/194.4 kB\u001b[0m \u001b[31m6.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
64
+ "\u001b[?25hCollecting multiprocess<0.70.17 (from datasets)\n",
65
+ " Downloading http://mirrors.aliyun.com/pypi/packages/0a/7d/a988f258104dcd2ccf1ed40fdc97e26c4ac351eeaf81d76e266c52d84e2f/multiprocess-0.70.16-py312-none-any.whl (146 kB)\n",
66
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m146.7/146.7 kB\u001b[0m \u001b[31m4.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
67
+ "\u001b[?25hRequirement already satisfied: fsspec<=2024.9.0,>=2023.1.0 in /root/miniconda3/lib/python3.12/site-packages (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets) (2024.5.0)\n",
68
+ "Collecting aiohttp (from datasets)\n",
69
+ " Downloading http://mirrors.aliyun.com/pypi/packages/40/7f/6de218084f9b653026bd7063cd8045123a7ba90c25176465f266976d8c82/aiohttp-3.11.11-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.7 MB)\n",
70
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m1.7/1.7 MB\u001b[0m \u001b[31m16.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m00:01\u001b[0m\n",
71
+ "\u001b[?25hCollecting aiohappyeyeballs>=2.3.0 (from aiohttp->datasets)\n",
72
+ " Downloading http://mirrors.aliyun.com/pypi/packages/b9/74/fbb6559de3607b3300b9be3cc64e97548d55678e44623db17820dbd20002/aiohappyeyeballs-2.4.4-py3-none-any.whl (14 kB)\n",
73
+ "Collecting aiosignal>=1.1.2 (from aiohttp->datasets)\n",
74
+ " Downloading http://mirrors.aliyun.com/pypi/packages/ec/6a/bc7e17a3e87a2985d3e8f4da4cd0f481060eb78fb08596c42be62c90a4d9/aiosignal-1.3.2-py2.py3-none-any.whl (7.6 kB)\n",
75
+ "Requirement already satisfied: attrs>=17.3.0 in /root/miniconda3/lib/python3.12/site-packages (from aiohttp->datasets) (23.2.0)\n",
76
+ "Collecting frozenlist>=1.1.1 (from aiohttp->datasets)\n",
77
+ " Downloading http://mirrors.aliyun.com/pypi/packages/af/f2/64b73a9bb86f5a89fb55450e97cd5c1f84a862d4ff90d9fd1a73ab0f64a5/frozenlist-1.5.0-cp312-cp312-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (283 kB)\n",
78
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m283.6/283.6 kB\u001b[0m \u001b[31m41.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
79
+ "\u001b[?25hCollecting multidict<7.0,>=4.5 (from aiohttp->datasets)\n",
80
+ " Downloading http://mirrors.aliyun.com/pypi/packages/d3/c8/529101d7176fe7dfe1d99604e48d69c5dfdcadb4f06561f465c8ef12b4df/multidict-6.1.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (131 kB)\n",
81
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━��━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m131.0/131.0 kB\u001b[0m \u001b[31m56.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
82
+ "\u001b[?25hCollecting propcache>=0.2.0 (from aiohttp->datasets)\n",
83
+ " Downloading http://mirrors.aliyun.com/pypi/packages/1c/07/ebe102777a830bca91bbb93e3479cd34c2ca5d0361b83be9dbd93104865e/propcache-0.2.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (243 kB)\n",
84
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m243.6/243.6 kB\u001b[0m \u001b[31m41.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
85
+ "\u001b[?25hCollecting yarl<2.0,>=1.17.0 (from aiohttp->datasets)\n",
86
+ " Downloading http://mirrors.aliyun.com/pypi/packages/1a/e1/a097d5755d3ea8479a42856f51d97eeff7a3a7160593332d98f2709b3580/yarl-1.18.3-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (336 kB)\n",
87
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m336.9/336.9 kB\u001b[0m \u001b[31m41.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
88
+ "\u001b[?25hRequirement already satisfied: typing-extensions>=3.7.4.3 in /root/miniconda3/lib/python3.12/site-packages (from huggingface-hub<1.0,>=0.24.0->transformers) (4.12.2)\n",
89
+ "Requirement already satisfied: annotated-types>=0.6.0 in /root/miniconda3/lib/python3.12/site-packages (from pydantic>=2.0.0->deepspeed) (0.7.0)\n",
90
+ "Requirement already satisfied: pydantic-core==2.27.2 in /root/miniconda3/lib/python3.12/site-packages (from pydantic>=2.0.0->deepspeed) (2.27.2)\n",
91
+ "Requirement already satisfied: charset-normalizer<4,>=2 in /root/miniconda3/lib/python3.12/site-packages (from requests->transformers) (2.0.4)\n",
92
+ "Requirement already satisfied: idna<4,>=2.5 in /root/miniconda3/lib/python3.12/site-packages (from requests->transformers) (3.7)\n",
93
+ "Requirement already satisfied: urllib3<3,>=1.21.1 in /root/miniconda3/lib/python3.12/site-packages (from requests->transformers) (2.1.0)\n",
94
+ "Requirement already satisfied: certifi>=2017.4.17 in /root/miniconda3/lib/python3.12/site-packages (from requests->transformers) (2024.2.2)\n",
95
+ "Requirement already satisfied: sympy in /root/miniconda3/lib/python3.12/site-packages (from torch->deepspeed) (1.12.1)\n",
96
+ "Requirement already satisfied: networkx in /root/miniconda3/lib/python3.12/site-packages (from torch->deepspeed) (3.3)\n",
97
+ "Requirement already satisfied: jinja2 in /root/miniconda3/lib/python3.12/site-packages (from torch->deepspeed) (3.1.4)\n",
98
+ "Requirement already satisfied: nvidia-cuda-nvrtc-cu12==12.1.105 in /root/miniconda3/lib/python3.12/site-packages (from torch->deepspeed) (12.1.105)\n",
99
+ "Requirement already satisfied: nvidia-cuda-runtime-cu12==12.1.105 in /root/miniconda3/lib/python3.12/site-packages (from torch->deepspeed) (12.1.105)\n",
100
+ "Requirement already satisfied: nvidia-cuda-cupti-cu12==12.1.105 in /root/miniconda3/lib/python3.12/site-packages (from torch->deepspeed) (12.1.105)\n",
101
+ "Requirement already satisfied: nvidia-cudnn-cu12==8.9.2.26 in /root/miniconda3/lib/python3.12/site-packages (from torch->deepspeed) (8.9.2.26)\n",
102
+ "Requirement already satisfied: nvidia-cublas-cu12==12.1.3.1 in /root/miniconda3/lib/python3.12/site-packages (from torch->deepspeed) (12.1.3.1)\n",
103
+ "Requirement already satisfied: nvidia-cufft-cu12==11.0.2.54 in /root/miniconda3/lib/python3.12/site-packages (from torch->deepspeed) (11.0.2.54)\n",
104
+ "Requirement already satisfied: nvidia-curand-cu12==10.3.2.106 in /root/miniconda3/lib/python3.12/site-packages (from torch->deepspeed) (10.3.2.106)\n",
105
+ "Requirement already satisfied: nvidia-cusolver-cu12==11.4.5.107 in /root/miniconda3/lib/python3.12/site-packages (from torch->deepspeed) (11.4.5.107)\n",
106
+ "Requirement already satisfied: nvidia-cusparse-cu12==12.1.0.106 in /root/miniconda3/lib/python3.12/site-packages (from torch->deepspeed) (12.1.0.106)\n",
107
+ "Requirement already satisfied: nvidia-nccl-cu12==2.20.5 in /root/miniconda3/lib/python3.12/site-packages (from torch->deepspeed) (2.20.5)\n",
108
+ "Requirement already satisfied: nvidia-nvtx-cu12==12.1.105 in /root/miniconda3/lib/python3.12/site-packages (from torch->deepspeed) (12.1.105)\n",
109
+ "Requirement already satisfied: nvidia-nvjitlink-cu12 in /root/miniconda3/lib/python3.12/site-packages (from nvidia-cusolver-cu12==11.4.5.107->torch->deepspeed) (12.5.40)\n",
110
+ "Requirement already satisfied: soupsieve>1.2 in /root/miniconda3/lib/python3.12/site-packages (from beautifulsoup4->google) (2.5)\n",
111
+ "Requirement already satisfied: python-dateutil>=2.8.2 in /root/miniconda3/lib/python3.12/site-packages (from pandas->datasets) (2.9.0.post0)\n",
112
+ "Collecting pytz>=2020.1 (from pandas->datasets)\n",
113
+ " Downloading http://mirrors.aliyun.com/pypi/packages/11/c3/005fcca25ce078d2cc29fd559379817424e94885510568bc1bc53d7d5846/pytz-2024.2-py2.py3-none-any.whl (508 kB)\n",
114
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m508.0/508.0 kB\u001b[0m \u001b[31m38.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
115
+ "\u001b[?25hCollecting tzdata>=2022.7 (from pandas->datasets)\n",
116
+ " Downloading http://mirrors.aliyun.com/pypi/packages/a6/ab/7e5f53c3b9d14972843a647d8d7a853969a58aecc7559cb3267302c94774/tzdata-2024.2-py2.py3-none-any.whl (346 kB)\n",
117
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m346.6/346.6 kB\u001b[0m \u001b[31m36.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
118
+ "\u001b[?25hRequirement already satisfied: six>=1.5 in /root/miniconda3/lib/python3.12/site-packages (from python-dateutil>=2.8.2->pandas->datasets) (1.16.0)\n",
119
+ "Requirement already satisfied: MarkupSafe>=2.0 in /root/miniconda3/lib/python3.12/site-packages (from jinja2->torch->deepspeed) (2.1.5)\n",
120
+ "Requirement already satisfied: mpmath<1.4.0,>=1.1.0 in /root/miniconda3/lib/python3.12/site-packages (from sympy->torch->deepspeed) (1.3.0)\n",
121
+ "Installing collected packages: pytz, xxhash, tzdata, tqdm, requests, pyarrow, propcache, multidict, frozenlist, dill, aiohappyeyeballs, yarl, pandas, multiprocess, aiosignal, aiohttp, datasets\n",
122
+ " Attempting uninstall: tqdm\n",
123
+ " Found existing installation: tqdm 4.66.2\n",
124
+ " Uninstalling tqdm-4.66.2:\n",
125
+ " Successfully uninstalled tqdm-4.66.2\n",
126
+ " Attempting uninstall: requests\n",
127
+ " Found existing installation: requests 2.31.0\n",
128
+ " Uninstalling requests-2.31.0:\n",
129
+ " Successfully uninstalled requests-2.31.0\n",
130
+ "Successfully installed aiohappyeyeballs-2.4.4 aiohttp-3.11.11 aiosignal-1.3.2 datasets-3.2.0 dill-0.3.8 frozenlist-1.5.0 multidict-6.1.0 multiprocess-0.70.16 pandas-2.2.3 propcache-0.2.1 pyarrow-18.1.0 pytz-2024.2 requests-2.32.3 tqdm-4.67.1 tzdata-2024.2 xxhash-3.5.0 yarl-1.18.3\n",
131
+ "\u001b[33mWARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv\u001b[0m\u001b[33m\n",
132
+ "\u001b[0m"
133
+ ]
134
+ }
135
+ ],
136
+ "source": [
137
+ "!pip install transformers sentencepiece google protobuf deepspeed peft datasets "
138
+ ]
139
+ },
140
+ {
141
+ "cell_type": "code",
142
+ "execution_count": 9,
143
+ "id": "4e906370-40c7-4f6b-a700-f183a9308c78",
144
+ "metadata": {},
145
+ "outputs": [
146
+ {
147
+ "name": "stdout",
148
+ "output_type": "stream",
149
+ "text": [
150
+ "https://hf-mirror.com\n"
151
+ ]
152
+ }
153
+ ],
154
+ "source": [
155
+ "import os\n",
156
+ "\n",
157
+ "# 设置环境变量, autodl专区 其他idc\n",
158
+ "os.environ['HF_ENDPOINT'] = 'https://hf-mirror.com'\n",
159
+ "\n",
160
+ "# 打印环境变量以确认设置成功\n",
161
+ "print(os.environ.get('HF_ENDPOINT'))"
162
+ ]
163
+ },
164
+ {
165
+ "cell_type": "code",
166
+ "execution_count": 1,
167
+ "id": "ecc98529-6581-41d2-a876-23ce5187cae7",
168
+ "metadata": {},
169
+ "outputs": [],
170
+ "source": [
171
+ "import subprocess\n",
172
+ "import os\n",
173
+ "# 设置环境变量, autodl一般区域\n",
174
+ "result = subprocess.run('bash -c \"source /etc/network_turbo && env | grep proxy\"', shell=True, capture_output=True, text=True)\n",
175
+ "output = result.stdout\n",
176
+ "for line in output.splitlines():\n",
177
+ " if '=' in line:\n",
178
+ " var, value = line.split('=', 1)\n",
179
+ " os.environ[var] = value"
180
+ ]
181
+ },
182
+ {
183
+ "cell_type": "code",
184
+ "execution_count": 2,
185
+ "id": "b01fc372-33af-46e5-8c0e-8bccba7237ee",
186
+ "metadata": {},
187
+ "outputs": [],
188
+ "source": [
189
+ "from datasets import load_dataset\n",
190
+ "# load ~11k samples from promoters prediction dataset\n",
191
+ "dataset = load_dataset(\"dnagpt/dna_core_promoter\")['train'].train_test_split(test_size=0.1)"
192
+ ]
193
+ },
194
+ {
195
+ "cell_type": "code",
196
+ "execution_count": 3,
197
+ "id": "136c38d4-bd0f-4ecd-9165-2fd5b5207c1d",
198
+ "metadata": {},
199
+ "outputs": [
200
+ {
201
+ "data": {
202
+ "text/plain": [
203
+ "DatasetDict({\n",
204
+ " train: Dataset({\n",
205
+ " features: ['sequence', 'label'],\n",
206
+ " num_rows: 53276\n",
207
+ " })\n",
208
+ " test: Dataset({\n",
209
+ " features: ['sequence', 'label'],\n",
210
+ " num_rows: 5920\n",
211
+ " })\n",
212
+ "})"
213
+ ]
214
+ },
215
+ "execution_count": 3,
216
+ "metadata": {},
217
+ "output_type": "execute_result"
218
+ }
219
+ ],
220
+ "source": [
221
+ "dataset"
222
+ ]
223
+ },
224
+ {
225
+ "cell_type": "markdown",
226
+ "id": "28acb64e-8d1e-4482-a515-344a2e0344c4",
227
+ "metadata": {},
228
+ "source": [
229
+ "## lfs 支持\n",
230
+ "apt-get update\n",
231
+ "\n",
232
+ "apt-get install git-lfs\n",
233
+ "\n",
234
+ "git lfs install"
235
+ ]
236
+ },
237
+ {
238
+ "cell_type": "code",
239
+ "execution_count": null,
240
+ "id": "3d3cefb0-1eed-4f23-8591-1990f7113820",
241
+ "metadata": {},
242
+ "outputs": [],
243
+ "source": []
244
+ }
245
+ ],
246
+ "metadata": {
247
+ "kernelspec": {
248
+ "display_name": "Python 3 (ipykernel)",
249
+ "language": "python",
250
+ "name": "python3"
251
+ },
252
+ "language_info": {
253
+ "codemirror_mode": {
254
+ "name": "ipython",
255
+ "version": 3
256
+ },
257
+ "file_extension": ".py",
258
+ "mimetype": "text/x-python",
259
+ "name": "python",
260
+ "nbconvert_exporter": "python",
261
+ "pygments_lexer": "ipython3",
262
+ "version": "3.12.3"
263
+ }
264
+ },
265
+ "nbformat": 4,
266
+ "nbformat_minor": 5
267
+ }
04-gene-sft/.ipynb_checkpoints/env_ini-checkpoint.ipynb ADDED
@@ -0,0 +1,267 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "code",
5
+ "execution_count": 5,
6
+ "id": "e3fbdac5-cd38-4e41-b5d2-d9d112b4ac1b",
7
+ "metadata": {
8
+ "scrolled": true
9
+ },
10
+ "outputs": [
11
+ {
12
+ "name": "stdout",
13
+ "output_type": "stream",
14
+ "text": [
15
+ "Looking in indexes: http://mirrors.aliyun.com/pypi/simple\n",
16
+ "Requirement already satisfied: transformers in /root/miniconda3/lib/python3.12/site-packages (4.47.1)\n",
17
+ "Requirement already satisfied: sentencepiece in /root/miniconda3/lib/python3.12/site-packages (0.2.0)\n",
18
+ "Requirement already satisfied: google in /root/miniconda3/lib/python3.12/site-packages (3.0.0)\n",
19
+ "Requirement already satisfied: protobuf in /root/miniconda3/lib/python3.12/site-packages (5.27.0)\n",
20
+ "Requirement already satisfied: deepspeed in /root/miniconda3/lib/python3.12/site-packages (0.16.2)\n",
21
+ "Requirement already satisfied: peft in /root/miniconda3/lib/python3.12/site-packages (0.14.0)\n",
22
+ "Collecting datasets\n",
23
+ " Downloading http://mirrors.aliyun.com/pypi/packages/d7/84/0df6c5981f5fc722381662ff8cfbdf8aad64bec875f75d80b55bfef394ce/datasets-3.2.0-py3-none-any.whl (480 kB)\n",
24
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m480.6/480.6 kB\u001b[0m \u001b[31m4.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m00:01\u001b[0m\n",
25
+ "\u001b[?25hRequirement already satisfied: filelock in /root/miniconda3/lib/python3.12/site-packages (from transformers) (3.14.0)\n",
26
+ "Requirement already satisfied: huggingface-hub<1.0,>=0.24.0 in /root/miniconda3/lib/python3.12/site-packages (from transformers) (0.27.0)\n",
27
+ "Requirement already satisfied: numpy>=1.17 in /root/miniconda3/lib/python3.12/site-packages (from transformers) (1.26.4)\n",
28
+ "Requirement already satisfied: packaging>=20.0 in /root/miniconda3/lib/python3.12/site-packages (from transformers) (23.2)\n",
29
+ "Requirement already satisfied: pyyaml>=5.1 in /root/miniconda3/lib/python3.12/site-packages (from transformers) (6.0.1)\n",
30
+ "Requirement already satisfied: regex!=2019.12.17 in /root/miniconda3/lib/python3.12/site-packages (from transformers) (2024.11.6)\n",
31
+ "Requirement already satisfied: requests in /root/miniconda3/lib/python3.12/site-packages (from transformers) (2.31.0)\n",
32
+ "Requirement already satisfied: tokenizers<0.22,>=0.21 in /root/miniconda3/lib/python3.12/site-packages (from transformers) (0.21.0)\n",
33
+ "Requirement already satisfied: safetensors>=0.4.1 in /root/miniconda3/lib/python3.12/site-packages (from transformers) (0.4.5)\n",
34
+ "Requirement already satisfied: tqdm>=4.27 in /root/miniconda3/lib/python3.12/site-packages (from transformers) (4.66.2)\n",
35
+ "Requirement already satisfied: beautifulsoup4 in /root/miniconda3/lib/python3.12/site-packages (from google) (4.12.3)\n",
36
+ "Requirement already satisfied: einops in /root/miniconda3/lib/python3.12/site-packages (from deepspeed) (0.8.0)\n",
37
+ "Requirement already satisfied: hjson in /root/miniconda3/lib/python3.12/site-packages (from deepspeed) (3.1.0)\n",
38
+ "Requirement already satisfied: msgpack in /root/miniconda3/lib/python3.12/site-packages (from deepspeed) (1.1.0)\n",
39
+ "Requirement already satisfied: ninja in /root/miniconda3/lib/python3.12/site-packages (from deepspeed) (1.11.1.3)\n",
40
+ "Requirement already satisfied: psutil in /root/miniconda3/lib/python3.12/site-packages (from deepspeed) (5.9.8)\n",
41
+ "Requirement already satisfied: py-cpuinfo in /root/miniconda3/lib/python3.12/site-packages (from deepspeed) (9.0.0)\n",
42
+ "Requirement already satisfied: pydantic>=2.0.0 in /root/miniconda3/lib/python3.12/site-packages (from deepspeed) (2.10.4)\n",
43
+ "Requirement already satisfied: torch in /root/miniconda3/lib/python3.12/site-packages (from deepspeed) (2.3.0+cu121)\n",
44
+ "Requirement already satisfied: nvidia-ml-py in /root/miniconda3/lib/python3.12/site-packages (from deepspeed) (12.560.30)\n",
45
+ "Requirement already satisfied: accelerate>=0.21.0 in /root/miniconda3/lib/python3.12/site-packages (from peft) (1.2.1)\n",
46
+ "Collecting pyarrow>=15.0.0 (from datasets)\n",
47
+ " Downloading http://mirrors.aliyun.com/pypi/packages/3a/2e/3b99f8a3d9e0ccae0e961978a0d0089b25fb46ebbcfb5ebae3cca179a5b3/pyarrow-18.1.0-cp312-cp312-manylinux_2_28_x86_64.whl (40.1 MB)\n",
48
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m40.1/40.1 MB\u001b[0m \u001b[31m14.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m00:01\u001b[0m\n",
49
+ "\u001b[?25hCollecting dill<0.3.9,>=0.3.0 (from datasets)\n",
50
+ " Downloading http://mirrors.aliyun.com/pypi/packages/c9/7a/cef76fd8438a42f96db64ddaa85280485a9c395e7df3db8158cfec1eee34/dill-0.3.8-py3-none-any.whl (116 kB)\n",
51
+ "\u001b[2K \u001b[90m━━━━━━━���━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m116.3/116.3 kB\u001b[0m \u001b[31m53.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
52
+ "\u001b[?25hCollecting pandas (from datasets)\n",
53
+ " Downloading http://mirrors.aliyun.com/pypi/packages/38/f8/d8fddee9ed0d0c0f4a2132c1dfcf0e3e53265055da8df952a53e7eaf178c/pandas-2.2.3-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.7 MB)\n",
54
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m12.7/12.7 MB\u001b[0m \u001b[31m13.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m00:01\u001b[0m\n",
55
+ "\u001b[?25hCollecting requests (from transformers)\n",
56
+ " Downloading http://mirrors.aliyun.com/pypi/packages/f9/9b/335f9764261e915ed497fcdeb11df5dfd6f7bf257d4a6a2a686d80da4d54/requests-2.32.3-py3-none-any.whl (64 kB)\n",
57
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m64.9/64.9 kB\u001b[0m \u001b[31m31.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
58
+ "\u001b[?25hCollecting tqdm>=4.27 (from transformers)\n",
59
+ " Downloading http://mirrors.aliyun.com/pypi/packages/d0/30/dc54f88dd4a2b5dc8a0279bdd7270e735851848b762aeb1c1184ed1f6b14/tqdm-4.67.1-py3-none-any.whl (78 kB)\n",
60
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m78.5/78.5 kB\u001b[0m \u001b[31m35.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
61
+ "\u001b[?25hCollecting xxhash (from datasets)\n",
62
+ " Downloading http://mirrors.aliyun.com/pypi/packages/11/a7/81dba5010f7e733de88af9555725146fc133be97ce36533867f4c7e75066/xxhash-3.5.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (194 kB)\n",
63
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m194.4/194.4 kB\u001b[0m \u001b[31m6.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
64
+ "\u001b[?25hCollecting multiprocess<0.70.17 (from datasets)\n",
65
+ " Downloading http://mirrors.aliyun.com/pypi/packages/0a/7d/a988f258104dcd2ccf1ed40fdc97e26c4ac351eeaf81d76e266c52d84e2f/multiprocess-0.70.16-py312-none-any.whl (146 kB)\n",
66
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m146.7/146.7 kB\u001b[0m \u001b[31m4.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
67
+ "\u001b[?25hRequirement already satisfied: fsspec<=2024.9.0,>=2023.1.0 in /root/miniconda3/lib/python3.12/site-packages (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets) (2024.5.0)\n",
68
+ "Collecting aiohttp (from datasets)\n",
69
+ " Downloading http://mirrors.aliyun.com/pypi/packages/40/7f/6de218084f9b653026bd7063cd8045123a7ba90c25176465f266976d8c82/aiohttp-3.11.11-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.7 MB)\n",
70
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m1.7/1.7 MB\u001b[0m \u001b[31m16.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m00:01\u001b[0m\n",
71
+ "\u001b[?25hCollecting aiohappyeyeballs>=2.3.0 (from aiohttp->datasets)\n",
72
+ " Downloading http://mirrors.aliyun.com/pypi/packages/b9/74/fbb6559de3607b3300b9be3cc64e97548d55678e44623db17820dbd20002/aiohappyeyeballs-2.4.4-py3-none-any.whl (14 kB)\n",
73
+ "Collecting aiosignal>=1.1.2 (from aiohttp->datasets)\n",
74
+ " Downloading http://mirrors.aliyun.com/pypi/packages/ec/6a/bc7e17a3e87a2985d3e8f4da4cd0f481060eb78fb08596c42be62c90a4d9/aiosignal-1.3.2-py2.py3-none-any.whl (7.6 kB)\n",
75
+ "Requirement already satisfied: attrs>=17.3.0 in /root/miniconda3/lib/python3.12/site-packages (from aiohttp->datasets) (23.2.0)\n",
76
+ "Collecting frozenlist>=1.1.1 (from aiohttp->datasets)\n",
77
+ " Downloading http://mirrors.aliyun.com/pypi/packages/af/f2/64b73a9bb86f5a89fb55450e97cd5c1f84a862d4ff90d9fd1a73ab0f64a5/frozenlist-1.5.0-cp312-cp312-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (283 kB)\n",
78
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m283.6/283.6 kB\u001b[0m \u001b[31m41.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
79
+ "\u001b[?25hCollecting multidict<7.0,>=4.5 (from aiohttp->datasets)\n",
80
+ " Downloading http://mirrors.aliyun.com/pypi/packages/d3/c8/529101d7176fe7dfe1d99604e48d69c5dfdcadb4f06561f465c8ef12b4df/multidict-6.1.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (131 kB)\n",
81
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━��━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m131.0/131.0 kB\u001b[0m \u001b[31m56.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
82
+ "\u001b[?25hCollecting propcache>=0.2.0 (from aiohttp->datasets)\n",
83
+ " Downloading http://mirrors.aliyun.com/pypi/packages/1c/07/ebe102777a830bca91bbb93e3479cd34c2ca5d0361b83be9dbd93104865e/propcache-0.2.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (243 kB)\n",
84
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m243.6/243.6 kB\u001b[0m \u001b[31m41.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
85
+ "\u001b[?25hCollecting yarl<2.0,>=1.17.0 (from aiohttp->datasets)\n",
86
+ " Downloading http://mirrors.aliyun.com/pypi/packages/1a/e1/a097d5755d3ea8479a42856f51d97eeff7a3a7160593332d98f2709b3580/yarl-1.18.3-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (336 kB)\n",
87
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m336.9/336.9 kB\u001b[0m \u001b[31m41.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
88
+ "\u001b[?25hRequirement already satisfied: typing-extensions>=3.7.4.3 in /root/miniconda3/lib/python3.12/site-packages (from huggingface-hub<1.0,>=0.24.0->transformers) (4.12.2)\n",
89
+ "Requirement already satisfied: annotated-types>=0.6.0 in /root/miniconda3/lib/python3.12/site-packages (from pydantic>=2.0.0->deepspeed) (0.7.0)\n",
90
+ "Requirement already satisfied: pydantic-core==2.27.2 in /root/miniconda3/lib/python3.12/site-packages (from pydantic>=2.0.0->deepspeed) (2.27.2)\n",
91
+ "Requirement already satisfied: charset-normalizer<4,>=2 in /root/miniconda3/lib/python3.12/site-packages (from requests->transformers) (2.0.4)\n",
92
+ "Requirement already satisfied: idna<4,>=2.5 in /root/miniconda3/lib/python3.12/site-packages (from requests->transformers) (3.7)\n",
93
+ "Requirement already satisfied: urllib3<3,>=1.21.1 in /root/miniconda3/lib/python3.12/site-packages (from requests->transformers) (2.1.0)\n",
94
+ "Requirement already satisfied: certifi>=2017.4.17 in /root/miniconda3/lib/python3.12/site-packages (from requests->transformers) (2024.2.2)\n",
95
+ "Requirement already satisfied: sympy in /root/miniconda3/lib/python3.12/site-packages (from torch->deepspeed) (1.12.1)\n",
96
+ "Requirement already satisfied: networkx in /root/miniconda3/lib/python3.12/site-packages (from torch->deepspeed) (3.3)\n",
97
+ "Requirement already satisfied: jinja2 in /root/miniconda3/lib/python3.12/site-packages (from torch->deepspeed) (3.1.4)\n",
98
+ "Requirement already satisfied: nvidia-cuda-nvrtc-cu12==12.1.105 in /root/miniconda3/lib/python3.12/site-packages (from torch->deepspeed) (12.1.105)\n",
99
+ "Requirement already satisfied: nvidia-cuda-runtime-cu12==12.1.105 in /root/miniconda3/lib/python3.12/site-packages (from torch->deepspeed) (12.1.105)\n",
100
+ "Requirement already satisfied: nvidia-cuda-cupti-cu12==12.1.105 in /root/miniconda3/lib/python3.12/site-packages (from torch->deepspeed) (12.1.105)\n",
101
+ "Requirement already satisfied: nvidia-cudnn-cu12==8.9.2.26 in /root/miniconda3/lib/python3.12/site-packages (from torch->deepspeed) (8.9.2.26)\n",
102
+ "Requirement already satisfied: nvidia-cublas-cu12==12.1.3.1 in /root/miniconda3/lib/python3.12/site-packages (from torch->deepspeed) (12.1.3.1)\n",
103
+ "Requirement already satisfied: nvidia-cufft-cu12==11.0.2.54 in /root/miniconda3/lib/python3.12/site-packages (from torch->deepspeed) (11.0.2.54)\n",
104
+ "Requirement already satisfied: nvidia-curand-cu12==10.3.2.106 in /root/miniconda3/lib/python3.12/site-packages (from torch->deepspeed) (10.3.2.106)\n",
105
+ "Requirement already satisfied: nvidia-cusolver-cu12==11.4.5.107 in /root/miniconda3/lib/python3.12/site-packages (from torch->deepspeed) (11.4.5.107)\n",
106
+ "Requirement already satisfied: nvidia-cusparse-cu12==12.1.0.106 in /root/miniconda3/lib/python3.12/site-packages (from torch->deepspeed) (12.1.0.106)\n",
107
+ "Requirement already satisfied: nvidia-nccl-cu12==2.20.5 in /root/miniconda3/lib/python3.12/site-packages (from torch->deepspeed) (2.20.5)\n",
108
+ "Requirement already satisfied: nvidia-nvtx-cu12==12.1.105 in /root/miniconda3/lib/python3.12/site-packages (from torch->deepspeed) (12.1.105)\n",
109
+ "Requirement already satisfied: nvidia-nvjitlink-cu12 in /root/miniconda3/lib/python3.12/site-packages (from nvidia-cusolver-cu12==11.4.5.107->torch->deepspeed) (12.5.40)\n",
110
+ "Requirement already satisfied: soupsieve>1.2 in /root/miniconda3/lib/python3.12/site-packages (from beautifulsoup4->google) (2.5)\n",
111
+ "Requirement already satisfied: python-dateutil>=2.8.2 in /root/miniconda3/lib/python3.12/site-packages (from pandas->datasets) (2.9.0.post0)\n",
112
+ "Collecting pytz>=2020.1 (from pandas->datasets)\n",
113
+ " Downloading http://mirrors.aliyun.com/pypi/packages/11/c3/005fcca25ce078d2cc29fd559379817424e94885510568bc1bc53d7d5846/pytz-2024.2-py2.py3-none-any.whl (508 kB)\n",
114
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m508.0/508.0 kB\u001b[0m \u001b[31m38.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
115
+ "\u001b[?25hCollecting tzdata>=2022.7 (from pandas->datasets)\n",
116
+ " Downloading http://mirrors.aliyun.com/pypi/packages/a6/ab/7e5f53c3b9d14972843a647d8d7a853969a58aecc7559cb3267302c94774/tzdata-2024.2-py2.py3-none-any.whl (346 kB)\n",
117
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m346.6/346.6 kB\u001b[0m \u001b[31m36.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
118
+ "\u001b[?25hRequirement already satisfied: six>=1.5 in /root/miniconda3/lib/python3.12/site-packages (from python-dateutil>=2.8.2->pandas->datasets) (1.16.0)\n",
119
+ "Requirement already satisfied: MarkupSafe>=2.0 in /root/miniconda3/lib/python3.12/site-packages (from jinja2->torch->deepspeed) (2.1.5)\n",
120
+ "Requirement already satisfied: mpmath<1.4.0,>=1.1.0 in /root/miniconda3/lib/python3.12/site-packages (from sympy->torch->deepspeed) (1.3.0)\n",
121
+ "Installing collected packages: pytz, xxhash, tzdata, tqdm, requests, pyarrow, propcache, multidict, frozenlist, dill, aiohappyeyeballs, yarl, pandas, multiprocess, aiosignal, aiohttp, datasets\n",
122
+ " Attempting uninstall: tqdm\n",
123
+ " Found existing installation: tqdm 4.66.2\n",
124
+ " Uninstalling tqdm-4.66.2:\n",
125
+ " Successfully uninstalled tqdm-4.66.2\n",
126
+ " Attempting uninstall: requests\n",
127
+ " Found existing installation: requests 2.31.0\n",
128
+ " Uninstalling requests-2.31.0:\n",
129
+ " Successfully uninstalled requests-2.31.0\n",
130
+ "Successfully installed aiohappyeyeballs-2.4.4 aiohttp-3.11.11 aiosignal-1.3.2 datasets-3.2.0 dill-0.3.8 frozenlist-1.5.0 multidict-6.1.0 multiprocess-0.70.16 pandas-2.2.3 propcache-0.2.1 pyarrow-18.1.0 pytz-2024.2 requests-2.32.3 tqdm-4.67.1 tzdata-2024.2 xxhash-3.5.0 yarl-1.18.3\n",
131
+ "\u001b[33mWARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv\u001b[0m\u001b[33m\n",
132
+ "\u001b[0m"
133
+ ]
134
+ }
135
+ ],
136
+ "source": [
137
+ "!pip install transformers sentencepiece google protobuf deepspeed peft datasets "
138
+ ]
139
+ },
140
+ {
141
+ "cell_type": "code",
142
+ "execution_count": 9,
143
+ "id": "4e906370-40c7-4f6b-a700-f183a9308c78",
144
+ "metadata": {},
145
+ "outputs": [
146
+ {
147
+ "name": "stdout",
148
+ "output_type": "stream",
149
+ "text": [
150
+ "https://hf-mirror.com\n"
151
+ ]
152
+ }
153
+ ],
154
+ "source": [
155
+ "import os\n",
156
+ "\n",
157
+ "# 设置环境变量, autodl专区 其他idc\n",
158
+ "os.environ['HF_ENDPOINT'] = 'https://hf-mirror.com'\n",
159
+ "\n",
160
+ "# 打印环境变量以确认设置成功\n",
161
+ "print(os.environ.get('HF_ENDPOINT'))"
162
+ ]
163
+ },
164
+ {
165
+ "cell_type": "code",
166
+ "execution_count": 1,
167
+ "id": "ecc98529-6581-41d2-a876-23ce5187cae7",
168
+ "metadata": {},
169
+ "outputs": [],
170
+ "source": [
171
+ "import subprocess\n",
172
+ "import os\n",
173
+ "# 设置环境变量, autodl一般区域\n",
174
+ "result = subprocess.run('bash -c \"source /etc/network_turbo && env | grep proxy\"', shell=True, capture_output=True, text=True)\n",
175
+ "output = result.stdout\n",
176
+ "for line in output.splitlines():\n",
177
+ " if '=' in line:\n",
178
+ " var, value = line.split('=', 1)\n",
179
+ " os.environ[var] = value"
180
+ ]
181
+ },
182
+ {
183
+ "cell_type": "code",
184
+ "execution_count": 2,
185
+ "id": "b01fc372-33af-46e5-8c0e-8bccba7237ee",
186
+ "metadata": {},
187
+ "outputs": [],
188
+ "source": [
189
+ "from datasets import load_dataset\n",
190
+ "# load ~11k samples from promoters prediction dataset\n",
191
+ "dataset = load_dataset(\"dnagpt/dna_core_promoter\")['train'].train_test_split(test_size=0.1)"
192
+ ]
193
+ },
194
+ {
195
+ "cell_type": "code",
196
+ "execution_count": 3,
197
+ "id": "136c38d4-bd0f-4ecd-9165-2fd5b5207c1d",
198
+ "metadata": {},
199
+ "outputs": [
200
+ {
201
+ "data": {
202
+ "text/plain": [
203
+ "DatasetDict({\n",
204
+ " train: Dataset({\n",
205
+ " features: ['sequence', 'label'],\n",
206
+ " num_rows: 53276\n",
207
+ " })\n",
208
+ " test: Dataset({\n",
209
+ " features: ['sequence', 'label'],\n",
210
+ " num_rows: 5920\n",
211
+ " })\n",
212
+ "})"
213
+ ]
214
+ },
215
+ "execution_count": 3,
216
+ "metadata": {},
217
+ "output_type": "execute_result"
218
+ }
219
+ ],
220
+ "source": [
221
+ "dataset"
222
+ ]
223
+ },
224
+ {
225
+ "cell_type": "markdown",
226
+ "id": "28acb64e-8d1e-4482-a515-344a2e0344c4",
227
+ "metadata": {},
228
+ "source": [
229
+ "## lfs 支持\n",
230
+ "apt-get update\n",
231
+ "\n",
232
+ "apt-get install git-lfs\n",
233
+ "\n",
234
+ "git lfs install"
235
+ ]
236
+ },
237
+ {
238
+ "cell_type": "code",
239
+ "execution_count": null,
240
+ "id": "3d3cefb0-1eed-4f23-8591-1990f7113820",
241
+ "metadata": {},
242
+ "outputs": [],
243
+ "source": []
244
+ }
245
+ ],
246
+ "metadata": {
247
+ "kernelspec": {
248
+ "display_name": "Python 3 (ipykernel)",
249
+ "language": "python",
250
+ "name": "python3"
251
+ },
252
+ "language_info": {
253
+ "codemirror_mode": {
254
+ "name": "ipython",
255
+ "version": 3
256
+ },
257
+ "file_extension": ".py",
258
+ "mimetype": "text/x-python",
259
+ "name": "python",
260
+ "nbconvert_exporter": "python",
261
+ "pygments_lexer": "ipython3",
262
+ "version": "3.12.3"
263
+ }
264
+ },
265
+ "nbformat": 4,
266
+ "nbformat_minor": 5
267
+ }
04-gene-sft/env_ini.ipynb ADDED
@@ -0,0 +1,267 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "code",
5
+ "execution_count": 5,
6
+ "id": "e3fbdac5-cd38-4e41-b5d2-d9d112b4ac1b",
7
+ "metadata": {
8
+ "scrolled": true
9
+ },
10
+ "outputs": [
11
+ {
12
+ "name": "stdout",
13
+ "output_type": "stream",
14
+ "text": [
15
+ "Looking in indexes: http://mirrors.aliyun.com/pypi/simple\n",
16
+ "Requirement already satisfied: transformers in /root/miniconda3/lib/python3.12/site-packages (4.47.1)\n",
17
+ "Requirement already satisfied: sentencepiece in /root/miniconda3/lib/python3.12/site-packages (0.2.0)\n",
18
+ "Requirement already satisfied: google in /root/miniconda3/lib/python3.12/site-packages (3.0.0)\n",
19
+ "Requirement already satisfied: protobuf in /root/miniconda3/lib/python3.12/site-packages (5.27.0)\n",
20
+ "Requirement already satisfied: deepspeed in /root/miniconda3/lib/python3.12/site-packages (0.16.2)\n",
21
+ "Requirement already satisfied: peft in /root/miniconda3/lib/python3.12/site-packages (0.14.0)\n",
22
+ "Collecting datasets\n",
23
+ " Downloading http://mirrors.aliyun.com/pypi/packages/d7/84/0df6c5981f5fc722381662ff8cfbdf8aad64bec875f75d80b55bfef394ce/datasets-3.2.0-py3-none-any.whl (480 kB)\n",
24
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m480.6/480.6 kB\u001b[0m \u001b[31m4.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m00:01\u001b[0m\n",
25
+ "\u001b[?25hRequirement already satisfied: filelock in /root/miniconda3/lib/python3.12/site-packages (from transformers) (3.14.0)\n",
26
+ "Requirement already satisfied: huggingface-hub<1.0,>=0.24.0 in /root/miniconda3/lib/python3.12/site-packages (from transformers) (0.27.0)\n",
27
+ "Requirement already satisfied: numpy>=1.17 in /root/miniconda3/lib/python3.12/site-packages (from transformers) (1.26.4)\n",
28
+ "Requirement already satisfied: packaging>=20.0 in /root/miniconda3/lib/python3.12/site-packages (from transformers) (23.2)\n",
29
+ "Requirement already satisfied: pyyaml>=5.1 in /root/miniconda3/lib/python3.12/site-packages (from transformers) (6.0.1)\n",
30
+ "Requirement already satisfied: regex!=2019.12.17 in /root/miniconda3/lib/python3.12/site-packages (from transformers) (2024.11.6)\n",
31
+ "Requirement already satisfied: requests in /root/miniconda3/lib/python3.12/site-packages (from transformers) (2.31.0)\n",
32
+ "Requirement already satisfied: tokenizers<0.22,>=0.21 in /root/miniconda3/lib/python3.12/site-packages (from transformers) (0.21.0)\n",
33
+ "Requirement already satisfied: safetensors>=0.4.1 in /root/miniconda3/lib/python3.12/site-packages (from transformers) (0.4.5)\n",
34
+ "Requirement already satisfied: tqdm>=4.27 in /root/miniconda3/lib/python3.12/site-packages (from transformers) (4.66.2)\n",
35
+ "Requirement already satisfied: beautifulsoup4 in /root/miniconda3/lib/python3.12/site-packages (from google) (4.12.3)\n",
36
+ "Requirement already satisfied: einops in /root/miniconda3/lib/python3.12/site-packages (from deepspeed) (0.8.0)\n",
37
+ "Requirement already satisfied: hjson in /root/miniconda3/lib/python3.12/site-packages (from deepspeed) (3.1.0)\n",
38
+ "Requirement already satisfied: msgpack in /root/miniconda3/lib/python3.12/site-packages (from deepspeed) (1.1.0)\n",
39
+ "Requirement already satisfied: ninja in /root/miniconda3/lib/python3.12/site-packages (from deepspeed) (1.11.1.3)\n",
40
+ "Requirement already satisfied: psutil in /root/miniconda3/lib/python3.12/site-packages (from deepspeed) (5.9.8)\n",
41
+ "Requirement already satisfied: py-cpuinfo in /root/miniconda3/lib/python3.12/site-packages (from deepspeed) (9.0.0)\n",
42
+ "Requirement already satisfied: pydantic>=2.0.0 in /root/miniconda3/lib/python3.12/site-packages (from deepspeed) (2.10.4)\n",
43
+ "Requirement already satisfied: torch in /root/miniconda3/lib/python3.12/site-packages (from deepspeed) (2.3.0+cu121)\n",
44
+ "Requirement already satisfied: nvidia-ml-py in /root/miniconda3/lib/python3.12/site-packages (from deepspeed) (12.560.30)\n",
45
+ "Requirement already satisfied: accelerate>=0.21.0 in /root/miniconda3/lib/python3.12/site-packages (from peft) (1.2.1)\n",
46
+ "Collecting pyarrow>=15.0.0 (from datasets)\n",
47
+ " Downloading http://mirrors.aliyun.com/pypi/packages/3a/2e/3b99f8a3d9e0ccae0e961978a0d0089b25fb46ebbcfb5ebae3cca179a5b3/pyarrow-18.1.0-cp312-cp312-manylinux_2_28_x86_64.whl (40.1 MB)\n",
48
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m40.1/40.1 MB\u001b[0m \u001b[31m14.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m00:01\u001b[0m\n",
49
+ "\u001b[?25hCollecting dill<0.3.9,>=0.3.0 (from datasets)\n",
50
+ " Downloading http://mirrors.aliyun.com/pypi/packages/c9/7a/cef76fd8438a42f96db64ddaa85280485a9c395e7df3db8158cfec1eee34/dill-0.3.8-py3-none-any.whl (116 kB)\n",
51
+ "\u001b[2K \u001b[90m━━━━━━━���━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m116.3/116.3 kB\u001b[0m \u001b[31m53.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
52
+ "\u001b[?25hCollecting pandas (from datasets)\n",
53
+ " Downloading http://mirrors.aliyun.com/pypi/packages/38/f8/d8fddee9ed0d0c0f4a2132c1dfcf0e3e53265055da8df952a53e7eaf178c/pandas-2.2.3-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.7 MB)\n",
54
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m12.7/12.7 MB\u001b[0m \u001b[31m13.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m00:01\u001b[0m\n",
55
+ "\u001b[?25hCollecting requests (from transformers)\n",
56
+ " Downloading http://mirrors.aliyun.com/pypi/packages/f9/9b/335f9764261e915ed497fcdeb11df5dfd6f7bf257d4a6a2a686d80da4d54/requests-2.32.3-py3-none-any.whl (64 kB)\n",
57
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m64.9/64.9 kB\u001b[0m \u001b[31m31.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
58
+ "\u001b[?25hCollecting tqdm>=4.27 (from transformers)\n",
59
+ " Downloading http://mirrors.aliyun.com/pypi/packages/d0/30/dc54f88dd4a2b5dc8a0279bdd7270e735851848b762aeb1c1184ed1f6b14/tqdm-4.67.1-py3-none-any.whl (78 kB)\n",
60
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m78.5/78.5 kB\u001b[0m \u001b[31m35.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
61
+ "\u001b[?25hCollecting xxhash (from datasets)\n",
62
+ " Downloading http://mirrors.aliyun.com/pypi/packages/11/a7/81dba5010f7e733de88af9555725146fc133be97ce36533867f4c7e75066/xxhash-3.5.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (194 kB)\n",
63
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m194.4/194.4 kB\u001b[0m \u001b[31m6.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
64
+ "\u001b[?25hCollecting multiprocess<0.70.17 (from datasets)\n",
65
+ " Downloading http://mirrors.aliyun.com/pypi/packages/0a/7d/a988f258104dcd2ccf1ed40fdc97e26c4ac351eeaf81d76e266c52d84e2f/multiprocess-0.70.16-py312-none-any.whl (146 kB)\n",
66
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m146.7/146.7 kB\u001b[0m \u001b[31m4.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
67
+ "\u001b[?25hRequirement already satisfied: fsspec<=2024.9.0,>=2023.1.0 in /root/miniconda3/lib/python3.12/site-packages (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets) (2024.5.0)\n",
68
+ "Collecting aiohttp (from datasets)\n",
69
+ " Downloading http://mirrors.aliyun.com/pypi/packages/40/7f/6de218084f9b653026bd7063cd8045123a7ba90c25176465f266976d8c82/aiohttp-3.11.11-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.7 MB)\n",
70
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m1.7/1.7 MB\u001b[0m \u001b[31m16.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m00:01\u001b[0m\n",
71
+ "\u001b[?25hCollecting aiohappyeyeballs>=2.3.0 (from aiohttp->datasets)\n",
72
+ " Downloading http://mirrors.aliyun.com/pypi/packages/b9/74/fbb6559de3607b3300b9be3cc64e97548d55678e44623db17820dbd20002/aiohappyeyeballs-2.4.4-py3-none-any.whl (14 kB)\n",
73
+ "Collecting aiosignal>=1.1.2 (from aiohttp->datasets)\n",
74
+ " Downloading http://mirrors.aliyun.com/pypi/packages/ec/6a/bc7e17a3e87a2985d3e8f4da4cd0f481060eb78fb08596c42be62c90a4d9/aiosignal-1.3.2-py2.py3-none-any.whl (7.6 kB)\n",
75
+ "Requirement already satisfied: attrs>=17.3.0 in /root/miniconda3/lib/python3.12/site-packages (from aiohttp->datasets) (23.2.0)\n",
76
+ "Collecting frozenlist>=1.1.1 (from aiohttp->datasets)\n",
77
+ " Downloading http://mirrors.aliyun.com/pypi/packages/af/f2/64b73a9bb86f5a89fb55450e97cd5c1f84a862d4ff90d9fd1a73ab0f64a5/frozenlist-1.5.0-cp312-cp312-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (283 kB)\n",
78
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m283.6/283.6 kB\u001b[0m \u001b[31m41.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
79
+ "\u001b[?25hCollecting multidict<7.0,>=4.5 (from aiohttp->datasets)\n",
80
+ " Downloading http://mirrors.aliyun.com/pypi/packages/d3/c8/529101d7176fe7dfe1d99604e48d69c5dfdcadb4f06561f465c8ef12b4df/multidict-6.1.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (131 kB)\n",
81
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━��━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m131.0/131.0 kB\u001b[0m \u001b[31m56.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
82
+ "\u001b[?25hCollecting propcache>=0.2.0 (from aiohttp->datasets)\n",
83
+ " Downloading http://mirrors.aliyun.com/pypi/packages/1c/07/ebe102777a830bca91bbb93e3479cd34c2ca5d0361b83be9dbd93104865e/propcache-0.2.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (243 kB)\n",
84
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m243.6/243.6 kB\u001b[0m \u001b[31m41.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
85
+ "\u001b[?25hCollecting yarl<2.0,>=1.17.0 (from aiohttp->datasets)\n",
86
+ " Downloading http://mirrors.aliyun.com/pypi/packages/1a/e1/a097d5755d3ea8479a42856f51d97eeff7a3a7160593332d98f2709b3580/yarl-1.18.3-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (336 kB)\n",
87
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m336.9/336.9 kB\u001b[0m \u001b[31m41.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
88
+ "\u001b[?25hRequirement already satisfied: typing-extensions>=3.7.4.3 in /root/miniconda3/lib/python3.12/site-packages (from huggingface-hub<1.0,>=0.24.0->transformers) (4.12.2)\n",
89
+ "Requirement already satisfied: annotated-types>=0.6.0 in /root/miniconda3/lib/python3.12/site-packages (from pydantic>=2.0.0->deepspeed) (0.7.0)\n",
90
+ "Requirement already satisfied: pydantic-core==2.27.2 in /root/miniconda3/lib/python3.12/site-packages (from pydantic>=2.0.0->deepspeed) (2.27.2)\n",
91
+ "Requirement already satisfied: charset-normalizer<4,>=2 in /root/miniconda3/lib/python3.12/site-packages (from requests->transformers) (2.0.4)\n",
92
+ "Requirement already satisfied: idna<4,>=2.5 in /root/miniconda3/lib/python3.12/site-packages (from requests->transformers) (3.7)\n",
93
+ "Requirement already satisfied: urllib3<3,>=1.21.1 in /root/miniconda3/lib/python3.12/site-packages (from requests->transformers) (2.1.0)\n",
94
+ "Requirement already satisfied: certifi>=2017.4.17 in /root/miniconda3/lib/python3.12/site-packages (from requests->transformers) (2024.2.2)\n",
95
+ "Requirement already satisfied: sympy in /root/miniconda3/lib/python3.12/site-packages (from torch->deepspeed) (1.12.1)\n",
96
+ "Requirement already satisfied: networkx in /root/miniconda3/lib/python3.12/site-packages (from torch->deepspeed) (3.3)\n",
97
+ "Requirement already satisfied: jinja2 in /root/miniconda3/lib/python3.12/site-packages (from torch->deepspeed) (3.1.4)\n",
98
+ "Requirement already satisfied: nvidia-cuda-nvrtc-cu12==12.1.105 in /root/miniconda3/lib/python3.12/site-packages (from torch->deepspeed) (12.1.105)\n",
99
+ "Requirement already satisfied: nvidia-cuda-runtime-cu12==12.1.105 in /root/miniconda3/lib/python3.12/site-packages (from torch->deepspeed) (12.1.105)\n",
100
+ "Requirement already satisfied: nvidia-cuda-cupti-cu12==12.1.105 in /root/miniconda3/lib/python3.12/site-packages (from torch->deepspeed) (12.1.105)\n",
101
+ "Requirement already satisfied: nvidia-cudnn-cu12==8.9.2.26 in /root/miniconda3/lib/python3.12/site-packages (from torch->deepspeed) (8.9.2.26)\n",
102
+ "Requirement already satisfied: nvidia-cublas-cu12==12.1.3.1 in /root/miniconda3/lib/python3.12/site-packages (from torch->deepspeed) (12.1.3.1)\n",
103
+ "Requirement already satisfied: nvidia-cufft-cu12==11.0.2.54 in /root/miniconda3/lib/python3.12/site-packages (from torch->deepspeed) (11.0.2.54)\n",
104
+ "Requirement already satisfied: nvidia-curand-cu12==10.3.2.106 in /root/miniconda3/lib/python3.12/site-packages (from torch->deepspeed) (10.3.2.106)\n",
105
+ "Requirement already satisfied: nvidia-cusolver-cu12==11.4.5.107 in /root/miniconda3/lib/python3.12/site-packages (from torch->deepspeed) (11.4.5.107)\n",
106
+ "Requirement already satisfied: nvidia-cusparse-cu12==12.1.0.106 in /root/miniconda3/lib/python3.12/site-packages (from torch->deepspeed) (12.1.0.106)\n",
107
+ "Requirement already satisfied: nvidia-nccl-cu12==2.20.5 in /root/miniconda3/lib/python3.12/site-packages (from torch->deepspeed) (2.20.5)\n",
108
+ "Requirement already satisfied: nvidia-nvtx-cu12==12.1.105 in /root/miniconda3/lib/python3.12/site-packages (from torch->deepspeed) (12.1.105)\n",
109
+ "Requirement already satisfied: nvidia-nvjitlink-cu12 in /root/miniconda3/lib/python3.12/site-packages (from nvidia-cusolver-cu12==11.4.5.107->torch->deepspeed) (12.5.40)\n",
110
+ "Requirement already satisfied: soupsieve>1.2 in /root/miniconda3/lib/python3.12/site-packages (from beautifulsoup4->google) (2.5)\n",
111
+ "Requirement already satisfied: python-dateutil>=2.8.2 in /root/miniconda3/lib/python3.12/site-packages (from pandas->datasets) (2.9.0.post0)\n",
112
+ "Collecting pytz>=2020.1 (from pandas->datasets)\n",
113
+ " Downloading http://mirrors.aliyun.com/pypi/packages/11/c3/005fcca25ce078d2cc29fd559379817424e94885510568bc1bc53d7d5846/pytz-2024.2-py2.py3-none-any.whl (508 kB)\n",
114
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m508.0/508.0 kB\u001b[0m \u001b[31m38.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
115
+ "\u001b[?25hCollecting tzdata>=2022.7 (from pandas->datasets)\n",
116
+ " Downloading http://mirrors.aliyun.com/pypi/packages/a6/ab/7e5f53c3b9d14972843a647d8d7a853969a58aecc7559cb3267302c94774/tzdata-2024.2-py2.py3-none-any.whl (346 kB)\n",
117
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m346.6/346.6 kB\u001b[0m \u001b[31m36.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
118
+ "\u001b[?25hRequirement already satisfied: six>=1.5 in /root/miniconda3/lib/python3.12/site-packages (from python-dateutil>=2.8.2->pandas->datasets) (1.16.0)\n",
119
+ "Requirement already satisfied: MarkupSafe>=2.0 in /root/miniconda3/lib/python3.12/site-packages (from jinja2->torch->deepspeed) (2.1.5)\n",
120
+ "Requirement already satisfied: mpmath<1.4.0,>=1.1.0 in /root/miniconda3/lib/python3.12/site-packages (from sympy->torch->deepspeed) (1.3.0)\n",
121
+ "Installing collected packages: pytz, xxhash, tzdata, tqdm, requests, pyarrow, propcache, multidict, frozenlist, dill, aiohappyeyeballs, yarl, pandas, multiprocess, aiosignal, aiohttp, datasets\n",
122
+ " Attempting uninstall: tqdm\n",
123
+ " Found existing installation: tqdm 4.66.2\n",
124
+ " Uninstalling tqdm-4.66.2:\n",
125
+ " Successfully uninstalled tqdm-4.66.2\n",
126
+ " Attempting uninstall: requests\n",
127
+ " Found existing installation: requests 2.31.0\n",
128
+ " Uninstalling requests-2.31.0:\n",
129
+ " Successfully uninstalled requests-2.31.0\n",
130
+ "Successfully installed aiohappyeyeballs-2.4.4 aiohttp-3.11.11 aiosignal-1.3.2 datasets-3.2.0 dill-0.3.8 frozenlist-1.5.0 multidict-6.1.0 multiprocess-0.70.16 pandas-2.2.3 propcache-0.2.1 pyarrow-18.1.0 pytz-2024.2 requests-2.32.3 tqdm-4.67.1 tzdata-2024.2 xxhash-3.5.0 yarl-1.18.3\n",
131
+ "\u001b[33mWARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv\u001b[0m\u001b[33m\n",
132
+ "\u001b[0m"
133
+ ]
134
+ }
135
+ ],
136
+ "source": [
137
+ "!pip install transformers sentencepiece google protobuf deepspeed peft datasets "
138
+ ]
139
+ },
140
+ {
141
+ "cell_type": "code",
142
+ "execution_count": 9,
143
+ "id": "4e906370-40c7-4f6b-a700-f183a9308c78",
144
+ "metadata": {},
145
+ "outputs": [
146
+ {
147
+ "name": "stdout",
148
+ "output_type": "stream",
149
+ "text": [
150
+ "https://hf-mirror.com\n"
151
+ ]
152
+ }
153
+ ],
154
+ "source": [
155
+ "import os\n",
156
+ "\n",
157
+ "# 设置环境变量, autodl专区 其他idc\n",
158
+ "os.environ['HF_ENDPOINT'] = 'https://hf-mirror.com'\n",
159
+ "\n",
160
+ "# 打印环境变量以确认设置成功\n",
161
+ "print(os.environ.get('HF_ENDPOINT'))"
162
+ ]
163
+ },
164
+ {
165
+ "cell_type": "code",
166
+ "execution_count": 1,
167
+ "id": "ecc98529-6581-41d2-a876-23ce5187cae7",
168
+ "metadata": {},
169
+ "outputs": [],
170
+ "source": [
171
+ "import subprocess\n",
172
+ "import os\n",
173
+ "# 设置环境变量, autodl一般区域\n",
174
+ "result = subprocess.run('bash -c \"source /etc/network_turbo && env | grep proxy\"', shell=True, capture_output=True, text=True)\n",
175
+ "output = result.stdout\n",
176
+ "for line in output.splitlines():\n",
177
+ " if '=' in line:\n",
178
+ " var, value = line.split('=', 1)\n",
179
+ " os.environ[var] = value"
180
+ ]
181
+ },
182
+ {
183
+ "cell_type": "code",
184
+ "execution_count": 2,
185
+ "id": "b01fc372-33af-46e5-8c0e-8bccba7237ee",
186
+ "metadata": {},
187
+ "outputs": [],
188
+ "source": [
189
+ "from datasets import load_dataset\n",
190
+ "# load ~11k samples from promoters prediction dataset\n",
191
+ "dataset = load_dataset(\"dnagpt/dna_core_promoter\")['train'].train_test_split(test_size=0.1)"
192
+ ]
193
+ },
194
+ {
195
+ "cell_type": "code",
196
+ "execution_count": 3,
197
+ "id": "136c38d4-bd0f-4ecd-9165-2fd5b5207c1d",
198
+ "metadata": {},
199
+ "outputs": [
200
+ {
201
+ "data": {
202
+ "text/plain": [
203
+ "DatasetDict({\n",
204
+ " train: Dataset({\n",
205
+ " features: ['sequence', 'label'],\n",
206
+ " num_rows: 53276\n",
207
+ " })\n",
208
+ " test: Dataset({\n",
209
+ " features: ['sequence', 'label'],\n",
210
+ " num_rows: 5920\n",
211
+ " })\n",
212
+ "})"
213
+ ]
214
+ },
215
+ "execution_count": 3,
216
+ "metadata": {},
217
+ "output_type": "execute_result"
218
+ }
219
+ ],
220
+ "source": [
221
+ "dataset"
222
+ ]
223
+ },
224
+ {
225
+ "cell_type": "markdown",
226
+ "id": "28acb64e-8d1e-4482-a515-344a2e0344c4",
227
+ "metadata": {},
228
+ "source": [
229
+ "## lfs 支持\n",
230
+ "apt-get update\n",
231
+ "\n",
232
+ "apt-get install git-lfs\n",
233
+ "\n",
234
+ "git lfs install"
235
+ ]
236
+ },
237
+ {
238
+ "cell_type": "code",
239
+ "execution_count": null,
240
+ "id": "3d3cefb0-1eed-4f23-8591-1990f7113820",
241
+ "metadata": {},
242
+ "outputs": [],
243
+ "source": []
244
+ }
245
+ ],
246
+ "metadata": {
247
+ "kernelspec": {
248
+ "display_name": "Python 3 (ipykernel)",
249
+ "language": "python",
250
+ "name": "python3"
251
+ },
252
+ "language_info": {
253
+ "codemirror_mode": {
254
+ "name": "ipython",
255
+ "version": 3
256
+ },
257
+ "file_extension": ".py",
258
+ "mimetype": "text/x-python",
259
+ "name": "python",
260
+ "nbconvert_exporter": "python",
261
+ "pygments_lexer": "ipython3",
262
+ "version": "3.12.3"
263
+ }
264
+ },
265
+ "nbformat": 4,
266
+ "nbformat_minor": 5
267
+ }
05-sota-model/.ipynb_checkpoints/env_ini-checkpoint.ipynb ADDED
@@ -0,0 +1,267 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "code",
5
+ "execution_count": 5,
6
+ "id": "e3fbdac5-cd38-4e41-b5d2-d9d112b4ac1b",
7
+ "metadata": {
8
+ "scrolled": true
9
+ },
10
+ "outputs": [
11
+ {
12
+ "name": "stdout",
13
+ "output_type": "stream",
14
+ "text": [
15
+ "Looking in indexes: http://mirrors.aliyun.com/pypi/simple\n",
16
+ "Requirement already satisfied: transformers in /root/miniconda3/lib/python3.12/site-packages (4.47.1)\n",
17
+ "Requirement already satisfied: sentencepiece in /root/miniconda3/lib/python3.12/site-packages (0.2.0)\n",
18
+ "Requirement already satisfied: google in /root/miniconda3/lib/python3.12/site-packages (3.0.0)\n",
19
+ "Requirement already satisfied: protobuf in /root/miniconda3/lib/python3.12/site-packages (5.27.0)\n",
20
+ "Requirement already satisfied: deepspeed in /root/miniconda3/lib/python3.12/site-packages (0.16.2)\n",
21
+ "Requirement already satisfied: peft in /root/miniconda3/lib/python3.12/site-packages (0.14.0)\n",
22
+ "Collecting datasets\n",
23
+ " Downloading http://mirrors.aliyun.com/pypi/packages/d7/84/0df6c5981f5fc722381662ff8cfbdf8aad64bec875f75d80b55bfef394ce/datasets-3.2.0-py3-none-any.whl (480 kB)\n",
24
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m480.6/480.6 kB\u001b[0m \u001b[31m4.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m00:01\u001b[0m\n",
25
+ "\u001b[?25hRequirement already satisfied: filelock in /root/miniconda3/lib/python3.12/site-packages (from transformers) (3.14.0)\n",
26
+ "Requirement already satisfied: huggingface-hub<1.0,>=0.24.0 in /root/miniconda3/lib/python3.12/site-packages (from transformers) (0.27.0)\n",
27
+ "Requirement already satisfied: numpy>=1.17 in /root/miniconda3/lib/python3.12/site-packages (from transformers) (1.26.4)\n",
28
+ "Requirement already satisfied: packaging>=20.0 in /root/miniconda3/lib/python3.12/site-packages (from transformers) (23.2)\n",
29
+ "Requirement already satisfied: pyyaml>=5.1 in /root/miniconda3/lib/python3.12/site-packages (from transformers) (6.0.1)\n",
30
+ "Requirement already satisfied: regex!=2019.12.17 in /root/miniconda3/lib/python3.12/site-packages (from transformers) (2024.11.6)\n",
31
+ "Requirement already satisfied: requests in /root/miniconda3/lib/python3.12/site-packages (from transformers) (2.31.0)\n",
32
+ "Requirement already satisfied: tokenizers<0.22,>=0.21 in /root/miniconda3/lib/python3.12/site-packages (from transformers) (0.21.0)\n",
33
+ "Requirement already satisfied: safetensors>=0.4.1 in /root/miniconda3/lib/python3.12/site-packages (from transformers) (0.4.5)\n",
34
+ "Requirement already satisfied: tqdm>=4.27 in /root/miniconda3/lib/python3.12/site-packages (from transformers) (4.66.2)\n",
35
+ "Requirement already satisfied: beautifulsoup4 in /root/miniconda3/lib/python3.12/site-packages (from google) (4.12.3)\n",
36
+ "Requirement already satisfied: einops in /root/miniconda3/lib/python3.12/site-packages (from deepspeed) (0.8.0)\n",
37
+ "Requirement already satisfied: hjson in /root/miniconda3/lib/python3.12/site-packages (from deepspeed) (3.1.0)\n",
38
+ "Requirement already satisfied: msgpack in /root/miniconda3/lib/python3.12/site-packages (from deepspeed) (1.1.0)\n",
39
+ "Requirement already satisfied: ninja in /root/miniconda3/lib/python3.12/site-packages (from deepspeed) (1.11.1.3)\n",
40
+ "Requirement already satisfied: psutil in /root/miniconda3/lib/python3.12/site-packages (from deepspeed) (5.9.8)\n",
41
+ "Requirement already satisfied: py-cpuinfo in /root/miniconda3/lib/python3.12/site-packages (from deepspeed) (9.0.0)\n",
42
+ "Requirement already satisfied: pydantic>=2.0.0 in /root/miniconda3/lib/python3.12/site-packages (from deepspeed) (2.10.4)\n",
43
+ "Requirement already satisfied: torch in /root/miniconda3/lib/python3.12/site-packages (from deepspeed) (2.3.0+cu121)\n",
44
+ "Requirement already satisfied: nvidia-ml-py in /root/miniconda3/lib/python3.12/site-packages (from deepspeed) (12.560.30)\n",
45
+ "Requirement already satisfied: accelerate>=0.21.0 in /root/miniconda3/lib/python3.12/site-packages (from peft) (1.2.1)\n",
46
+ "Collecting pyarrow>=15.0.0 (from datasets)\n",
47
+ " Downloading http://mirrors.aliyun.com/pypi/packages/3a/2e/3b99f8a3d9e0ccae0e961978a0d0089b25fb46ebbcfb5ebae3cca179a5b3/pyarrow-18.1.0-cp312-cp312-manylinux_2_28_x86_64.whl (40.1 MB)\n",
48
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m40.1/40.1 MB\u001b[0m \u001b[31m14.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m00:01\u001b[0m\n",
49
+ "\u001b[?25hCollecting dill<0.3.9,>=0.3.0 (from datasets)\n",
50
+ " Downloading http://mirrors.aliyun.com/pypi/packages/c9/7a/cef76fd8438a42f96db64ddaa85280485a9c395e7df3db8158cfec1eee34/dill-0.3.8-py3-none-any.whl (116 kB)\n",
51
+ "\u001b[2K \u001b[90m━━━━━━━���━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m116.3/116.3 kB\u001b[0m \u001b[31m53.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
52
+ "\u001b[?25hCollecting pandas (from datasets)\n",
53
+ " Downloading http://mirrors.aliyun.com/pypi/packages/38/f8/d8fddee9ed0d0c0f4a2132c1dfcf0e3e53265055da8df952a53e7eaf178c/pandas-2.2.3-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.7 MB)\n",
54
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m12.7/12.7 MB\u001b[0m \u001b[31m13.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m00:01\u001b[0m\n",
55
+ "\u001b[?25hCollecting requests (from transformers)\n",
56
+ " Downloading http://mirrors.aliyun.com/pypi/packages/f9/9b/335f9764261e915ed497fcdeb11df5dfd6f7bf257d4a6a2a686d80da4d54/requests-2.32.3-py3-none-any.whl (64 kB)\n",
57
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m64.9/64.9 kB\u001b[0m \u001b[31m31.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
58
+ "\u001b[?25hCollecting tqdm>=4.27 (from transformers)\n",
59
+ " Downloading http://mirrors.aliyun.com/pypi/packages/d0/30/dc54f88dd4a2b5dc8a0279bdd7270e735851848b762aeb1c1184ed1f6b14/tqdm-4.67.1-py3-none-any.whl (78 kB)\n",
60
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m78.5/78.5 kB\u001b[0m \u001b[31m35.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
61
+ "\u001b[?25hCollecting xxhash (from datasets)\n",
62
+ " Downloading http://mirrors.aliyun.com/pypi/packages/11/a7/81dba5010f7e733de88af9555725146fc133be97ce36533867f4c7e75066/xxhash-3.5.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (194 kB)\n",
63
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m194.4/194.4 kB\u001b[0m \u001b[31m6.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
64
+ "\u001b[?25hCollecting multiprocess<0.70.17 (from datasets)\n",
65
+ " Downloading http://mirrors.aliyun.com/pypi/packages/0a/7d/a988f258104dcd2ccf1ed40fdc97e26c4ac351eeaf81d76e266c52d84e2f/multiprocess-0.70.16-py312-none-any.whl (146 kB)\n",
66
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m146.7/146.7 kB\u001b[0m \u001b[31m4.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
67
+ "\u001b[?25hRequirement already satisfied: fsspec<=2024.9.0,>=2023.1.0 in /root/miniconda3/lib/python3.12/site-packages (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets) (2024.5.0)\n",
68
+ "Collecting aiohttp (from datasets)\n",
69
+ " Downloading http://mirrors.aliyun.com/pypi/packages/40/7f/6de218084f9b653026bd7063cd8045123a7ba90c25176465f266976d8c82/aiohttp-3.11.11-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.7 MB)\n",
70
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m1.7/1.7 MB\u001b[0m \u001b[31m16.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m00:01\u001b[0m\n",
71
+ "\u001b[?25hCollecting aiohappyeyeballs>=2.3.0 (from aiohttp->datasets)\n",
72
+ " Downloading http://mirrors.aliyun.com/pypi/packages/b9/74/fbb6559de3607b3300b9be3cc64e97548d55678e44623db17820dbd20002/aiohappyeyeballs-2.4.4-py3-none-any.whl (14 kB)\n",
73
+ "Collecting aiosignal>=1.1.2 (from aiohttp->datasets)\n",
74
+ " Downloading http://mirrors.aliyun.com/pypi/packages/ec/6a/bc7e17a3e87a2985d3e8f4da4cd0f481060eb78fb08596c42be62c90a4d9/aiosignal-1.3.2-py2.py3-none-any.whl (7.6 kB)\n",
75
+ "Requirement already satisfied: attrs>=17.3.0 in /root/miniconda3/lib/python3.12/site-packages (from aiohttp->datasets) (23.2.0)\n",
76
+ "Collecting frozenlist>=1.1.1 (from aiohttp->datasets)\n",
77
+ " Downloading http://mirrors.aliyun.com/pypi/packages/af/f2/64b73a9bb86f5a89fb55450e97cd5c1f84a862d4ff90d9fd1a73ab0f64a5/frozenlist-1.5.0-cp312-cp312-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (283 kB)\n",
78
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m283.6/283.6 kB\u001b[0m \u001b[31m41.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
79
+ "\u001b[?25hCollecting multidict<7.0,>=4.5 (from aiohttp->datasets)\n",
80
+ " Downloading http://mirrors.aliyun.com/pypi/packages/d3/c8/529101d7176fe7dfe1d99604e48d69c5dfdcadb4f06561f465c8ef12b4df/multidict-6.1.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (131 kB)\n",
81
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━��━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m131.0/131.0 kB\u001b[0m \u001b[31m56.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
82
+ "\u001b[?25hCollecting propcache>=0.2.0 (from aiohttp->datasets)\n",
83
+ " Downloading http://mirrors.aliyun.com/pypi/packages/1c/07/ebe102777a830bca91bbb93e3479cd34c2ca5d0361b83be9dbd93104865e/propcache-0.2.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (243 kB)\n",
84
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m243.6/243.6 kB\u001b[0m \u001b[31m41.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
85
+ "\u001b[?25hCollecting yarl<2.0,>=1.17.0 (from aiohttp->datasets)\n",
86
+ " Downloading http://mirrors.aliyun.com/pypi/packages/1a/e1/a097d5755d3ea8479a42856f51d97eeff7a3a7160593332d98f2709b3580/yarl-1.18.3-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (336 kB)\n",
87
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m336.9/336.9 kB\u001b[0m \u001b[31m41.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
88
+ "\u001b[?25hRequirement already satisfied: typing-extensions>=3.7.4.3 in /root/miniconda3/lib/python3.12/site-packages (from huggingface-hub<1.0,>=0.24.0->transformers) (4.12.2)\n",
89
+ "Requirement already satisfied: annotated-types>=0.6.0 in /root/miniconda3/lib/python3.12/site-packages (from pydantic>=2.0.0->deepspeed) (0.7.0)\n",
90
+ "Requirement already satisfied: pydantic-core==2.27.2 in /root/miniconda3/lib/python3.12/site-packages (from pydantic>=2.0.0->deepspeed) (2.27.2)\n",
91
+ "Requirement already satisfied: charset-normalizer<4,>=2 in /root/miniconda3/lib/python3.12/site-packages (from requests->transformers) (2.0.4)\n",
92
+ "Requirement already satisfied: idna<4,>=2.5 in /root/miniconda3/lib/python3.12/site-packages (from requests->transformers) (3.7)\n",
93
+ "Requirement already satisfied: urllib3<3,>=1.21.1 in /root/miniconda3/lib/python3.12/site-packages (from requests->transformers) (2.1.0)\n",
94
+ "Requirement already satisfied: certifi>=2017.4.17 in /root/miniconda3/lib/python3.12/site-packages (from requests->transformers) (2024.2.2)\n",
95
+ "Requirement already satisfied: sympy in /root/miniconda3/lib/python3.12/site-packages (from torch->deepspeed) (1.12.1)\n",
96
+ "Requirement already satisfied: networkx in /root/miniconda3/lib/python3.12/site-packages (from torch->deepspeed) (3.3)\n",
97
+ "Requirement already satisfied: jinja2 in /root/miniconda3/lib/python3.12/site-packages (from torch->deepspeed) (3.1.4)\n",
98
+ "Requirement already satisfied: nvidia-cuda-nvrtc-cu12==12.1.105 in /root/miniconda3/lib/python3.12/site-packages (from torch->deepspeed) (12.1.105)\n",
99
+ "Requirement already satisfied: nvidia-cuda-runtime-cu12==12.1.105 in /root/miniconda3/lib/python3.12/site-packages (from torch->deepspeed) (12.1.105)\n",
100
+ "Requirement already satisfied: nvidia-cuda-cupti-cu12==12.1.105 in /root/miniconda3/lib/python3.12/site-packages (from torch->deepspeed) (12.1.105)\n",
101
+ "Requirement already satisfied: nvidia-cudnn-cu12==8.9.2.26 in /root/miniconda3/lib/python3.12/site-packages (from torch->deepspeed) (8.9.2.26)\n",
102
+ "Requirement already satisfied: nvidia-cublas-cu12==12.1.3.1 in /root/miniconda3/lib/python3.12/site-packages (from torch->deepspeed) (12.1.3.1)\n",
103
+ "Requirement already satisfied: nvidia-cufft-cu12==11.0.2.54 in /root/miniconda3/lib/python3.12/site-packages (from torch->deepspeed) (11.0.2.54)\n",
104
+ "Requirement already satisfied: nvidia-curand-cu12==10.3.2.106 in /root/miniconda3/lib/python3.12/site-packages (from torch->deepspeed) (10.3.2.106)\n",
105
+ "Requirement already satisfied: nvidia-cusolver-cu12==11.4.5.107 in /root/miniconda3/lib/python3.12/site-packages (from torch->deepspeed) (11.4.5.107)\n",
106
+ "Requirement already satisfied: nvidia-cusparse-cu12==12.1.0.106 in /root/miniconda3/lib/python3.12/site-packages (from torch->deepspeed) (12.1.0.106)\n",
107
+ "Requirement already satisfied: nvidia-nccl-cu12==2.20.5 in /root/miniconda3/lib/python3.12/site-packages (from torch->deepspeed) (2.20.5)\n",
108
+ "Requirement already satisfied: nvidia-nvtx-cu12==12.1.105 in /root/miniconda3/lib/python3.12/site-packages (from torch->deepspeed) (12.1.105)\n",
109
+ "Requirement already satisfied: nvidia-nvjitlink-cu12 in /root/miniconda3/lib/python3.12/site-packages (from nvidia-cusolver-cu12==11.4.5.107->torch->deepspeed) (12.5.40)\n",
110
+ "Requirement already satisfied: soupsieve>1.2 in /root/miniconda3/lib/python3.12/site-packages (from beautifulsoup4->google) (2.5)\n",
111
+ "Requirement already satisfied: python-dateutil>=2.8.2 in /root/miniconda3/lib/python3.12/site-packages (from pandas->datasets) (2.9.0.post0)\n",
112
+ "Collecting pytz>=2020.1 (from pandas->datasets)\n",
113
+ " Downloading http://mirrors.aliyun.com/pypi/packages/11/c3/005fcca25ce078d2cc29fd559379817424e94885510568bc1bc53d7d5846/pytz-2024.2-py2.py3-none-any.whl (508 kB)\n",
114
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m508.0/508.0 kB\u001b[0m \u001b[31m38.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
115
+ "\u001b[?25hCollecting tzdata>=2022.7 (from pandas->datasets)\n",
116
+ " Downloading http://mirrors.aliyun.com/pypi/packages/a6/ab/7e5f53c3b9d14972843a647d8d7a853969a58aecc7559cb3267302c94774/tzdata-2024.2-py2.py3-none-any.whl (346 kB)\n",
117
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m346.6/346.6 kB\u001b[0m \u001b[31m36.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
118
+ "\u001b[?25hRequirement already satisfied: six>=1.5 in /root/miniconda3/lib/python3.12/site-packages (from python-dateutil>=2.8.2->pandas->datasets) (1.16.0)\n",
119
+ "Requirement already satisfied: MarkupSafe>=2.0 in /root/miniconda3/lib/python3.12/site-packages (from jinja2->torch->deepspeed) (2.1.5)\n",
120
+ "Requirement already satisfied: mpmath<1.4.0,>=1.1.0 in /root/miniconda3/lib/python3.12/site-packages (from sympy->torch->deepspeed) (1.3.0)\n",
121
+ "Installing collected packages: pytz, xxhash, tzdata, tqdm, requests, pyarrow, propcache, multidict, frozenlist, dill, aiohappyeyeballs, yarl, pandas, multiprocess, aiosignal, aiohttp, datasets\n",
122
+ " Attempting uninstall: tqdm\n",
123
+ " Found existing installation: tqdm 4.66.2\n",
124
+ " Uninstalling tqdm-4.66.2:\n",
125
+ " Successfully uninstalled tqdm-4.66.2\n",
126
+ " Attempting uninstall: requests\n",
127
+ " Found existing installation: requests 2.31.0\n",
128
+ " Uninstalling requests-2.31.0:\n",
129
+ " Successfully uninstalled requests-2.31.0\n",
130
+ "Successfully installed aiohappyeyeballs-2.4.4 aiohttp-3.11.11 aiosignal-1.3.2 datasets-3.2.0 dill-0.3.8 frozenlist-1.5.0 multidict-6.1.0 multiprocess-0.70.16 pandas-2.2.3 propcache-0.2.1 pyarrow-18.1.0 pytz-2024.2 requests-2.32.3 tqdm-4.67.1 tzdata-2024.2 xxhash-3.5.0 yarl-1.18.3\n",
131
+ "\u001b[33mWARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv\u001b[0m\u001b[33m\n",
132
+ "\u001b[0m"
133
+ ]
134
+ }
135
+ ],
136
+ "source": [
137
+ "!pip install transformers sentencepiece google protobuf deepspeed peft datasets "
138
+ ]
139
+ },
140
+ {
141
+ "cell_type": "code",
142
+ "execution_count": 9,
143
+ "id": "4e906370-40c7-4f6b-a700-f183a9308c78",
144
+ "metadata": {},
145
+ "outputs": [
146
+ {
147
+ "name": "stdout",
148
+ "output_type": "stream",
149
+ "text": [
150
+ "https://hf-mirror.com\n"
151
+ ]
152
+ }
153
+ ],
154
+ "source": [
155
+ "import os\n",
156
+ "\n",
157
+ "# 设置环境变量, autodl专区 其他idc\n",
158
+ "os.environ['HF_ENDPOINT'] = 'https://hf-mirror.com'\n",
159
+ "\n",
160
+ "# 打印环境变量以确认设置成功\n",
161
+ "print(os.environ.get('HF_ENDPOINT'))"
162
+ ]
163
+ },
164
+ {
165
+ "cell_type": "code",
166
+ "execution_count": 1,
167
+ "id": "ecc98529-6581-41d2-a876-23ce5187cae7",
168
+ "metadata": {},
169
+ "outputs": [],
170
+ "source": [
171
+ "import subprocess\n",
172
+ "import os\n",
173
+ "# 设置环境变量, autodl一般区域\n",
174
+ "result = subprocess.run('bash -c \"source /etc/network_turbo && env | grep proxy\"', shell=True, capture_output=True, text=True)\n",
175
+ "output = result.stdout\n",
176
+ "for line in output.splitlines():\n",
177
+ " if '=' in line:\n",
178
+ " var, value = line.split('=', 1)\n",
179
+ " os.environ[var] = value"
180
+ ]
181
+ },
182
+ {
183
+ "cell_type": "code",
184
+ "execution_count": 2,
185
+ "id": "b01fc372-33af-46e5-8c0e-8bccba7237ee",
186
+ "metadata": {},
187
+ "outputs": [],
188
+ "source": [
189
+ "from datasets import load_dataset\n",
190
+ "# load ~11k samples from promoters prediction dataset\n",
191
+ "dataset = load_dataset(\"dnagpt/dna_core_promoter\")['train'].train_test_split(test_size=0.1)"
192
+ ]
193
+ },
194
+ {
195
+ "cell_type": "code",
196
+ "execution_count": 3,
197
+ "id": "136c38d4-bd0f-4ecd-9165-2fd5b5207c1d",
198
+ "metadata": {},
199
+ "outputs": [
200
+ {
201
+ "data": {
202
+ "text/plain": [
203
+ "DatasetDict({\n",
204
+ " train: Dataset({\n",
205
+ " features: ['sequence', 'label'],\n",
206
+ " num_rows: 53276\n",
207
+ " })\n",
208
+ " test: Dataset({\n",
209
+ " features: ['sequence', 'label'],\n",
210
+ " num_rows: 5920\n",
211
+ " })\n",
212
+ "})"
213
+ ]
214
+ },
215
+ "execution_count": 3,
216
+ "metadata": {},
217
+ "output_type": "execute_result"
218
+ }
219
+ ],
220
+ "source": [
221
+ "dataset"
222
+ ]
223
+ },
224
+ {
225
+ "cell_type": "markdown",
226
+ "id": "28acb64e-8d1e-4482-a515-344a2e0344c4",
227
+ "metadata": {},
228
+ "source": [
229
+ "## lfs 支持\n",
230
+ "apt-get update\n",
231
+ "\n",
232
+ "apt-get install git-lfs\n",
233
+ "\n",
234
+ "git lfs install"
235
+ ]
236
+ },
237
+ {
238
+ "cell_type": "code",
239
+ "execution_count": null,
240
+ "id": "3d3cefb0-1eed-4f23-8591-1990f7113820",
241
+ "metadata": {},
242
+ "outputs": [],
243
+ "source": []
244
+ }
245
+ ],
246
+ "metadata": {
247
+ "kernelspec": {
248
+ "display_name": "Python 3 (ipykernel)",
249
+ "language": "python",
250
+ "name": "python3"
251
+ },
252
+ "language_info": {
253
+ "codemirror_mode": {
254
+ "name": "ipython",
255
+ "version": 3
256
+ },
257
+ "file_extension": ".py",
258
+ "mimetype": "text/x-python",
259
+ "name": "python",
260
+ "nbconvert_exporter": "python",
261
+ "pygments_lexer": "ipython3",
262
+ "version": "3.12.3"
263
+ }
264
+ },
265
+ "nbformat": 4,
266
+ "nbformat_minor": 5
267
+ }
05-sota-model/env_ini.ipynb ADDED
@@ -0,0 +1,267 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "code",
5
+ "execution_count": 5,
6
+ "id": "e3fbdac5-cd38-4e41-b5d2-d9d112b4ac1b",
7
+ "metadata": {
8
+ "scrolled": true
9
+ },
10
+ "outputs": [
11
+ {
12
+ "name": "stdout",
13
+ "output_type": "stream",
14
+ "text": [
15
+ "Looking in indexes: http://mirrors.aliyun.com/pypi/simple\n",
16
+ "Requirement already satisfied: transformers in /root/miniconda3/lib/python3.12/site-packages (4.47.1)\n",
17
+ "Requirement already satisfied: sentencepiece in /root/miniconda3/lib/python3.12/site-packages (0.2.0)\n",
18
+ "Requirement already satisfied: google in /root/miniconda3/lib/python3.12/site-packages (3.0.0)\n",
19
+ "Requirement already satisfied: protobuf in /root/miniconda3/lib/python3.12/site-packages (5.27.0)\n",
20
+ "Requirement already satisfied: deepspeed in /root/miniconda3/lib/python3.12/site-packages (0.16.2)\n",
21
+ "Requirement already satisfied: peft in /root/miniconda3/lib/python3.12/site-packages (0.14.0)\n",
22
+ "Collecting datasets\n",
23
+ " Downloading http://mirrors.aliyun.com/pypi/packages/d7/84/0df6c5981f5fc722381662ff8cfbdf8aad64bec875f75d80b55bfef394ce/datasets-3.2.0-py3-none-any.whl (480 kB)\n",
24
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m480.6/480.6 kB\u001b[0m \u001b[31m4.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m00:01\u001b[0m\n",
25
+ "\u001b[?25hRequirement already satisfied: filelock in /root/miniconda3/lib/python3.12/site-packages (from transformers) (3.14.0)\n",
26
+ "Requirement already satisfied: huggingface-hub<1.0,>=0.24.0 in /root/miniconda3/lib/python3.12/site-packages (from transformers) (0.27.0)\n",
27
+ "Requirement already satisfied: numpy>=1.17 in /root/miniconda3/lib/python3.12/site-packages (from transformers) (1.26.4)\n",
28
+ "Requirement already satisfied: packaging>=20.0 in /root/miniconda3/lib/python3.12/site-packages (from transformers) (23.2)\n",
29
+ "Requirement already satisfied: pyyaml>=5.1 in /root/miniconda3/lib/python3.12/site-packages (from transformers) (6.0.1)\n",
30
+ "Requirement already satisfied: regex!=2019.12.17 in /root/miniconda3/lib/python3.12/site-packages (from transformers) (2024.11.6)\n",
31
+ "Requirement already satisfied: requests in /root/miniconda3/lib/python3.12/site-packages (from transformers) (2.31.0)\n",
32
+ "Requirement already satisfied: tokenizers<0.22,>=0.21 in /root/miniconda3/lib/python3.12/site-packages (from transformers) (0.21.0)\n",
33
+ "Requirement already satisfied: safetensors>=0.4.1 in /root/miniconda3/lib/python3.12/site-packages (from transformers) (0.4.5)\n",
34
+ "Requirement already satisfied: tqdm>=4.27 in /root/miniconda3/lib/python3.12/site-packages (from transformers) (4.66.2)\n",
35
+ "Requirement already satisfied: beautifulsoup4 in /root/miniconda3/lib/python3.12/site-packages (from google) (4.12.3)\n",
36
+ "Requirement already satisfied: einops in /root/miniconda3/lib/python3.12/site-packages (from deepspeed) (0.8.0)\n",
37
+ "Requirement already satisfied: hjson in /root/miniconda3/lib/python3.12/site-packages (from deepspeed) (3.1.0)\n",
38
+ "Requirement already satisfied: msgpack in /root/miniconda3/lib/python3.12/site-packages (from deepspeed) (1.1.0)\n",
39
+ "Requirement already satisfied: ninja in /root/miniconda3/lib/python3.12/site-packages (from deepspeed) (1.11.1.3)\n",
40
+ "Requirement already satisfied: psutil in /root/miniconda3/lib/python3.12/site-packages (from deepspeed) (5.9.8)\n",
41
+ "Requirement already satisfied: py-cpuinfo in /root/miniconda3/lib/python3.12/site-packages (from deepspeed) (9.0.0)\n",
42
+ "Requirement already satisfied: pydantic>=2.0.0 in /root/miniconda3/lib/python3.12/site-packages (from deepspeed) (2.10.4)\n",
43
+ "Requirement already satisfied: torch in /root/miniconda3/lib/python3.12/site-packages (from deepspeed) (2.3.0+cu121)\n",
44
+ "Requirement already satisfied: nvidia-ml-py in /root/miniconda3/lib/python3.12/site-packages (from deepspeed) (12.560.30)\n",
45
+ "Requirement already satisfied: accelerate>=0.21.0 in /root/miniconda3/lib/python3.12/site-packages (from peft) (1.2.1)\n",
46
+ "Collecting pyarrow>=15.0.0 (from datasets)\n",
47
+ " Downloading http://mirrors.aliyun.com/pypi/packages/3a/2e/3b99f8a3d9e0ccae0e961978a0d0089b25fb46ebbcfb5ebae3cca179a5b3/pyarrow-18.1.0-cp312-cp312-manylinux_2_28_x86_64.whl (40.1 MB)\n",
48
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m40.1/40.1 MB\u001b[0m \u001b[31m14.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m00:01\u001b[0m\n",
49
+ "\u001b[?25hCollecting dill<0.3.9,>=0.3.0 (from datasets)\n",
50
+ " Downloading http://mirrors.aliyun.com/pypi/packages/c9/7a/cef76fd8438a42f96db64ddaa85280485a9c395e7df3db8158cfec1eee34/dill-0.3.8-py3-none-any.whl (116 kB)\n",
51
+ "\u001b[2K \u001b[90m━━━━━━━���━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m116.3/116.3 kB\u001b[0m \u001b[31m53.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
52
+ "\u001b[?25hCollecting pandas (from datasets)\n",
53
+ " Downloading http://mirrors.aliyun.com/pypi/packages/38/f8/d8fddee9ed0d0c0f4a2132c1dfcf0e3e53265055da8df952a53e7eaf178c/pandas-2.2.3-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.7 MB)\n",
54
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m12.7/12.7 MB\u001b[0m \u001b[31m13.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m00:01\u001b[0m\n",
55
+ "\u001b[?25hCollecting requests (from transformers)\n",
56
+ " Downloading http://mirrors.aliyun.com/pypi/packages/f9/9b/335f9764261e915ed497fcdeb11df5dfd6f7bf257d4a6a2a686d80da4d54/requests-2.32.3-py3-none-any.whl (64 kB)\n",
57
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m64.9/64.9 kB\u001b[0m \u001b[31m31.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
58
+ "\u001b[?25hCollecting tqdm>=4.27 (from transformers)\n",
59
+ " Downloading http://mirrors.aliyun.com/pypi/packages/d0/30/dc54f88dd4a2b5dc8a0279bdd7270e735851848b762aeb1c1184ed1f6b14/tqdm-4.67.1-py3-none-any.whl (78 kB)\n",
60
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m78.5/78.5 kB\u001b[0m \u001b[31m35.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
61
+ "\u001b[?25hCollecting xxhash (from datasets)\n",
62
+ " Downloading http://mirrors.aliyun.com/pypi/packages/11/a7/81dba5010f7e733de88af9555725146fc133be97ce36533867f4c7e75066/xxhash-3.5.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (194 kB)\n",
63
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m194.4/194.4 kB\u001b[0m \u001b[31m6.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
64
+ "\u001b[?25hCollecting multiprocess<0.70.17 (from datasets)\n",
65
+ " Downloading http://mirrors.aliyun.com/pypi/packages/0a/7d/a988f258104dcd2ccf1ed40fdc97e26c4ac351eeaf81d76e266c52d84e2f/multiprocess-0.70.16-py312-none-any.whl (146 kB)\n",
66
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m146.7/146.7 kB\u001b[0m \u001b[31m4.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
67
+ "\u001b[?25hRequirement already satisfied: fsspec<=2024.9.0,>=2023.1.0 in /root/miniconda3/lib/python3.12/site-packages (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets) (2024.5.0)\n",
68
+ "Collecting aiohttp (from datasets)\n",
69
+ " Downloading http://mirrors.aliyun.com/pypi/packages/40/7f/6de218084f9b653026bd7063cd8045123a7ba90c25176465f266976d8c82/aiohttp-3.11.11-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.7 MB)\n",
70
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m1.7/1.7 MB\u001b[0m \u001b[31m16.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m00:01\u001b[0m\n",
71
+ "\u001b[?25hCollecting aiohappyeyeballs>=2.3.0 (from aiohttp->datasets)\n",
72
+ " Downloading http://mirrors.aliyun.com/pypi/packages/b9/74/fbb6559de3607b3300b9be3cc64e97548d55678e44623db17820dbd20002/aiohappyeyeballs-2.4.4-py3-none-any.whl (14 kB)\n",
73
+ "Collecting aiosignal>=1.1.2 (from aiohttp->datasets)\n",
74
+ " Downloading http://mirrors.aliyun.com/pypi/packages/ec/6a/bc7e17a3e87a2985d3e8f4da4cd0f481060eb78fb08596c42be62c90a4d9/aiosignal-1.3.2-py2.py3-none-any.whl (7.6 kB)\n",
75
+ "Requirement already satisfied: attrs>=17.3.0 in /root/miniconda3/lib/python3.12/site-packages (from aiohttp->datasets) (23.2.0)\n",
76
+ "Collecting frozenlist>=1.1.1 (from aiohttp->datasets)\n",
77
+ " Downloading http://mirrors.aliyun.com/pypi/packages/af/f2/64b73a9bb86f5a89fb55450e97cd5c1f84a862d4ff90d9fd1a73ab0f64a5/frozenlist-1.5.0-cp312-cp312-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (283 kB)\n",
78
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m283.6/283.6 kB\u001b[0m \u001b[31m41.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
79
+ "\u001b[?25hCollecting multidict<7.0,>=4.5 (from aiohttp->datasets)\n",
80
+ " Downloading http://mirrors.aliyun.com/pypi/packages/d3/c8/529101d7176fe7dfe1d99604e48d69c5dfdcadb4f06561f465c8ef12b4df/multidict-6.1.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (131 kB)\n",
81
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━��━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m131.0/131.0 kB\u001b[0m \u001b[31m56.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
82
+ "\u001b[?25hCollecting propcache>=0.2.0 (from aiohttp->datasets)\n",
83
+ " Downloading http://mirrors.aliyun.com/pypi/packages/1c/07/ebe102777a830bca91bbb93e3479cd34c2ca5d0361b83be9dbd93104865e/propcache-0.2.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (243 kB)\n",
84
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m243.6/243.6 kB\u001b[0m \u001b[31m41.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
85
+ "\u001b[?25hCollecting yarl<2.0,>=1.17.0 (from aiohttp->datasets)\n",
86
+ " Downloading http://mirrors.aliyun.com/pypi/packages/1a/e1/a097d5755d3ea8479a42856f51d97eeff7a3a7160593332d98f2709b3580/yarl-1.18.3-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (336 kB)\n",
87
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m336.9/336.9 kB\u001b[0m \u001b[31m41.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
88
+ "\u001b[?25hRequirement already satisfied: typing-extensions>=3.7.4.3 in /root/miniconda3/lib/python3.12/site-packages (from huggingface-hub<1.0,>=0.24.0->transformers) (4.12.2)\n",
89
+ "Requirement already satisfied: annotated-types>=0.6.0 in /root/miniconda3/lib/python3.12/site-packages (from pydantic>=2.0.0->deepspeed) (0.7.0)\n",
90
+ "Requirement already satisfied: pydantic-core==2.27.2 in /root/miniconda3/lib/python3.12/site-packages (from pydantic>=2.0.0->deepspeed) (2.27.2)\n",
91
+ "Requirement already satisfied: charset-normalizer<4,>=2 in /root/miniconda3/lib/python3.12/site-packages (from requests->transformers) (2.0.4)\n",
92
+ "Requirement already satisfied: idna<4,>=2.5 in /root/miniconda3/lib/python3.12/site-packages (from requests->transformers) (3.7)\n",
93
+ "Requirement already satisfied: urllib3<3,>=1.21.1 in /root/miniconda3/lib/python3.12/site-packages (from requests->transformers) (2.1.0)\n",
94
+ "Requirement already satisfied: certifi>=2017.4.17 in /root/miniconda3/lib/python3.12/site-packages (from requests->transformers) (2024.2.2)\n",
95
+ "Requirement already satisfied: sympy in /root/miniconda3/lib/python3.12/site-packages (from torch->deepspeed) (1.12.1)\n",
96
+ "Requirement already satisfied: networkx in /root/miniconda3/lib/python3.12/site-packages (from torch->deepspeed) (3.3)\n",
97
+ "Requirement already satisfied: jinja2 in /root/miniconda3/lib/python3.12/site-packages (from torch->deepspeed) (3.1.4)\n",
98
+ "Requirement already satisfied: nvidia-cuda-nvrtc-cu12==12.1.105 in /root/miniconda3/lib/python3.12/site-packages (from torch->deepspeed) (12.1.105)\n",
99
+ "Requirement already satisfied: nvidia-cuda-runtime-cu12==12.1.105 in /root/miniconda3/lib/python3.12/site-packages (from torch->deepspeed) (12.1.105)\n",
100
+ "Requirement already satisfied: nvidia-cuda-cupti-cu12==12.1.105 in /root/miniconda3/lib/python3.12/site-packages (from torch->deepspeed) (12.1.105)\n",
101
+ "Requirement already satisfied: nvidia-cudnn-cu12==8.9.2.26 in /root/miniconda3/lib/python3.12/site-packages (from torch->deepspeed) (8.9.2.26)\n",
102
+ "Requirement already satisfied: nvidia-cublas-cu12==12.1.3.1 in /root/miniconda3/lib/python3.12/site-packages (from torch->deepspeed) (12.1.3.1)\n",
103
+ "Requirement already satisfied: nvidia-cufft-cu12==11.0.2.54 in /root/miniconda3/lib/python3.12/site-packages (from torch->deepspeed) (11.0.2.54)\n",
104
+ "Requirement already satisfied: nvidia-curand-cu12==10.3.2.106 in /root/miniconda3/lib/python3.12/site-packages (from torch->deepspeed) (10.3.2.106)\n",
105
+ "Requirement already satisfied: nvidia-cusolver-cu12==11.4.5.107 in /root/miniconda3/lib/python3.12/site-packages (from torch->deepspeed) (11.4.5.107)\n",
106
+ "Requirement already satisfied: nvidia-cusparse-cu12==12.1.0.106 in /root/miniconda3/lib/python3.12/site-packages (from torch->deepspeed) (12.1.0.106)\n",
107
+ "Requirement already satisfied: nvidia-nccl-cu12==2.20.5 in /root/miniconda3/lib/python3.12/site-packages (from torch->deepspeed) (2.20.5)\n",
108
+ "Requirement already satisfied: nvidia-nvtx-cu12==12.1.105 in /root/miniconda3/lib/python3.12/site-packages (from torch->deepspeed) (12.1.105)\n",
109
+ "Requirement already satisfied: nvidia-nvjitlink-cu12 in /root/miniconda3/lib/python3.12/site-packages (from nvidia-cusolver-cu12==11.4.5.107->torch->deepspeed) (12.5.40)\n",
110
+ "Requirement already satisfied: soupsieve>1.2 in /root/miniconda3/lib/python3.12/site-packages (from beautifulsoup4->google) (2.5)\n",
111
+ "Requirement already satisfied: python-dateutil>=2.8.2 in /root/miniconda3/lib/python3.12/site-packages (from pandas->datasets) (2.9.0.post0)\n",
112
+ "Collecting pytz>=2020.1 (from pandas->datasets)\n",
113
+ " Downloading http://mirrors.aliyun.com/pypi/packages/11/c3/005fcca25ce078d2cc29fd559379817424e94885510568bc1bc53d7d5846/pytz-2024.2-py2.py3-none-any.whl (508 kB)\n",
114
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m508.0/508.0 kB\u001b[0m \u001b[31m38.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
115
+ "\u001b[?25hCollecting tzdata>=2022.7 (from pandas->datasets)\n",
116
+ " Downloading http://mirrors.aliyun.com/pypi/packages/a6/ab/7e5f53c3b9d14972843a647d8d7a853969a58aecc7559cb3267302c94774/tzdata-2024.2-py2.py3-none-any.whl (346 kB)\n",
117
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m346.6/346.6 kB\u001b[0m \u001b[31m36.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
118
+ "\u001b[?25hRequirement already satisfied: six>=1.5 in /root/miniconda3/lib/python3.12/site-packages (from python-dateutil>=2.8.2->pandas->datasets) (1.16.0)\n",
119
+ "Requirement already satisfied: MarkupSafe>=2.0 in /root/miniconda3/lib/python3.12/site-packages (from jinja2->torch->deepspeed) (2.1.5)\n",
120
+ "Requirement already satisfied: mpmath<1.4.0,>=1.1.0 in /root/miniconda3/lib/python3.12/site-packages (from sympy->torch->deepspeed) (1.3.0)\n",
121
+ "Installing collected packages: pytz, xxhash, tzdata, tqdm, requests, pyarrow, propcache, multidict, frozenlist, dill, aiohappyeyeballs, yarl, pandas, multiprocess, aiosignal, aiohttp, datasets\n",
122
+ " Attempting uninstall: tqdm\n",
123
+ " Found existing installation: tqdm 4.66.2\n",
124
+ " Uninstalling tqdm-4.66.2:\n",
125
+ " Successfully uninstalled tqdm-4.66.2\n",
126
+ " Attempting uninstall: requests\n",
127
+ " Found existing installation: requests 2.31.0\n",
128
+ " Uninstalling requests-2.31.0:\n",
129
+ " Successfully uninstalled requests-2.31.0\n",
130
+ "Successfully installed aiohappyeyeballs-2.4.4 aiohttp-3.11.11 aiosignal-1.3.2 datasets-3.2.0 dill-0.3.8 frozenlist-1.5.0 multidict-6.1.0 multiprocess-0.70.16 pandas-2.2.3 propcache-0.2.1 pyarrow-18.1.0 pytz-2024.2 requests-2.32.3 tqdm-4.67.1 tzdata-2024.2 xxhash-3.5.0 yarl-1.18.3\n",
131
+ "\u001b[33mWARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv\u001b[0m\u001b[33m\n",
132
+ "\u001b[0m"
133
+ ]
134
+ }
135
+ ],
136
+ "source": [
137
+ "!pip install transformers sentencepiece google protobuf deepspeed peft datasets "
138
+ ]
139
+ },
140
+ {
141
+ "cell_type": "code",
142
+ "execution_count": 9,
143
+ "id": "4e906370-40c7-4f6b-a700-f183a9308c78",
144
+ "metadata": {},
145
+ "outputs": [
146
+ {
147
+ "name": "stdout",
148
+ "output_type": "stream",
149
+ "text": [
150
+ "https://hf-mirror.com\n"
151
+ ]
152
+ }
153
+ ],
154
+ "source": [
155
+ "import os\n",
156
+ "\n",
157
+ "# 设置环境变量, autodl专区 其他idc\n",
158
+ "os.environ['HF_ENDPOINT'] = 'https://hf-mirror.com'\n",
159
+ "\n",
160
+ "# 打印环境变量以确认设置成功\n",
161
+ "print(os.environ.get('HF_ENDPOINT'))"
162
+ ]
163
+ },
164
+ {
165
+ "cell_type": "code",
166
+ "execution_count": 1,
167
+ "id": "ecc98529-6581-41d2-a876-23ce5187cae7",
168
+ "metadata": {},
169
+ "outputs": [],
170
+ "source": [
171
+ "import subprocess\n",
172
+ "import os\n",
173
+ "# 设置环境变量, autodl一般区域\n",
174
+ "result = subprocess.run('bash -c \"source /etc/network_turbo && env | grep proxy\"', shell=True, capture_output=True, text=True)\n",
175
+ "output = result.stdout\n",
176
+ "for line in output.splitlines():\n",
177
+ " if '=' in line:\n",
178
+ " var, value = line.split('=', 1)\n",
179
+ " os.environ[var] = value"
180
+ ]
181
+ },
182
+ {
183
+ "cell_type": "code",
184
+ "execution_count": 2,
185
+ "id": "b01fc372-33af-46e5-8c0e-8bccba7237ee",
186
+ "metadata": {},
187
+ "outputs": [],
188
+ "source": [
189
+ "from datasets import load_dataset\n",
190
+ "# load ~11k samples from promoters prediction dataset\n",
191
+ "dataset = load_dataset(\"dnagpt/dna_core_promoter\")['train'].train_test_split(test_size=0.1)"
192
+ ]
193
+ },
194
+ {
195
+ "cell_type": "code",
196
+ "execution_count": 3,
197
+ "id": "136c38d4-bd0f-4ecd-9165-2fd5b5207c1d",
198
+ "metadata": {},
199
+ "outputs": [
200
+ {
201
+ "data": {
202
+ "text/plain": [
203
+ "DatasetDict({\n",
204
+ " train: Dataset({\n",
205
+ " features: ['sequence', 'label'],\n",
206
+ " num_rows: 53276\n",
207
+ " })\n",
208
+ " test: Dataset({\n",
209
+ " features: ['sequence', 'label'],\n",
210
+ " num_rows: 5920\n",
211
+ " })\n",
212
+ "})"
213
+ ]
214
+ },
215
+ "execution_count": 3,
216
+ "metadata": {},
217
+ "output_type": "execute_result"
218
+ }
219
+ ],
220
+ "source": [
221
+ "dataset"
222
+ ]
223
+ },
224
+ {
225
+ "cell_type": "markdown",
226
+ "id": "28acb64e-8d1e-4482-a515-344a2e0344c4",
227
+ "metadata": {},
228
+ "source": [
229
+ "## lfs 支持\n",
230
+ "apt-get update\n",
231
+ "\n",
232
+ "apt-get install git-lfs\n",
233
+ "\n",
234
+ "git lfs install"
235
+ ]
236
+ },
237
+ {
238
+ "cell_type": "code",
239
+ "execution_count": null,
240
+ "id": "3d3cefb0-1eed-4f23-8591-1990f7113820",
241
+ "metadata": {},
242
+ "outputs": [],
243
+ "source": []
244
+ }
245
+ ],
246
+ "metadata": {
247
+ "kernelspec": {
248
+ "display_name": "Python 3 (ipykernel)",
249
+ "language": "python",
250
+ "name": "python3"
251
+ },
252
+ "language_info": {
253
+ "codemirror_mode": {
254
+ "name": "ipython",
255
+ "version": 3
256
+ },
257
+ "file_extension": ".py",
258
+ "mimetype": "text/x-python",
259
+ "name": "python",
260
+ "nbconvert_exporter": "python",
261
+ "pygments_lexer": "ipython3",
262
+ "version": "3.12.3"
263
+ }
264
+ },
265
+ "nbformat": 4,
266
+ "nbformat_minor": 5
267
+ }
env_ini.ipynb CHANGED
@@ -221,15 +221,26 @@
221
  "dataset"
222
  ]
223
  },
 
 
 
 
 
 
 
 
 
 
 
 
 
224
  {
225
  "cell_type": "code",
226
  "execution_count": null,
227
- "id": "3724fdb2-ca96-4439-8263-8ff76caa7e92",
228
  "metadata": {},
229
  "outputs": [],
230
- "source": [
231
- "#读取前面10个数据\n"
232
- ]
233
  }
234
  ],
235
  "metadata": {
 
221
  "dataset"
222
  ]
223
  },
224
+ {
225
+ "cell_type": "markdown",
226
+ "id": "28acb64e-8d1e-4482-a515-344a2e0344c4",
227
+ "metadata": {},
228
+ "source": [
229
+ "## lfs 支持\n",
230
+ "apt-get update\n",
231
+ "\n",
232
+ "apt-get install git-lfs\n",
233
+ "\n",
234
+ "git lfs install"
235
+ ]
236
+ },
237
  {
238
  "cell_type": "code",
239
  "execution_count": null,
240
+ "id": "3d3cefb0-1eed-4f23-8591-1990f7113820",
241
  "metadata": {},
242
  "outputs": [],
243
+ "source": []
 
 
244
  }
245
  ],
246
  "metadata": {
lecture_intro.ipynb ADDED
@@ -0,0 +1,99 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "markdown",
5
+ "id": "b0a2c873-7447-4402-85b5-facbb0d7c0a3",
6
+ "metadata": {},
7
+ "source": [
8
+ "# DNAGPT2- The Best Beginner's Guide to Gene Sequence Large Language Models\n",
9
+ "\n",
10
+ "### 1. Overview\n",
11
+ "Large language models have long transcended the NLP research domain, becoming a cornerstone for AI in science. Gene sequences in bioinformatics are most similar to natural language, making the application of large models to biological sequence studies a hot research direction in recent years. The 2024 Nobel Prize in Chemistry awarded to AlphaFold for predicting protein structures has further illuminated the future path for biological research.\n",
12
+ "\n",
13
+ "However, for most biologists, large models remain unfamiliar territory. Until 2023, models like GPT were niche topics within NLP research, only gaining public attention due to the emergence of ChatGPT.\n",
14
+ "\n",
15
+ "Most biology + large model research has emerged post-2023, but the significant interdisciplinary gap means these studies are typically collaborative efforts by large companies and teams. Replicating or learning from this work is challenging for many researchers, as evidenced by the issues sections of top papers on GitHub.\n",
16
+ "\n",
17
+ "On one hand, large models are almost certain to shape the future of biological research; on the other, many researchers hesitate at the threshold of large models. Providing a bridge over this gap is thus an urgent need.\n",
18
+ "\n",
19
+ "DNAGTP2 serves as this bridge, aiming to facilitate more biologists in overcoming the large model barrier and leveraging these powerful tools to advance their work.\n",
20
+ "\n",
21
+ "### 2. Tutorial Characteristics\n",
22
+ "This tutorial is characterized by:\n",
23
+ "\n",
24
+ "1. **Simplicity**: Simple code entirely built using Hugging Face’s standard libraries.\n",
25
+ "2. **Simplicity**: Basic theoretical explanations with full visual aids.\n",
26
+ "3. **Simplicity**: Classic paper cases that are easy to understand.\n",
27
+ "\n",
28
+ "Despite its simplicity, the tutorial covers comprehensive content, from building tokenizers to constructing GPT, BERT models from scratch, fine-tuning LLaMA models, basic DeepSpeed multi-GPU distributed training, and applying SOTA models like LucaOne and ESM3. It combines typical biological tasks such as sequence classification, structure prediction, and regression analysis, progressively unfolding.\n",
29
+ "\n",
30
+ "### Target Audience:\n",
31
+ "1. Researchers and students in the field of biology, especially bioinformatics.\n",
32
+ "2. Beginners in large model learning, applicable beyond just biology.\n",
33
+ "\n",
34
+ "### 3. Tutorial Outline\n",
35
+ "#### 1 Data and Environment\n",
36
+ "1.1 Introduction to Large Model Runtime Environments \n",
37
+ "1.2 Pre-trained and Fine-tuning Data Related to Genes \n",
38
+ "1.3 Basic Usage of Datasets Library \n",
39
+ "\n",
40
+ "#### 2 Building DNA GPT2/Bert Large Models from Scratch\n",
41
+ "2.1 Building DNA Tokenizer \n",
42
+ "2.2 Training DNA GPT2 Model from Scratch \n",
43
+ "2.3 Training DNA Bert Model from Scratch \n",
44
+ "2.4 Feature Extraction for Biological Sequences Using Gene Large Models \n",
45
+ "2.5 Building Large Models Based on Multimodal Data \n",
46
+ "\n",
47
+ "#### 3 Biological Sequence Tasks Using Gene Large Models\n",
48
+ "3.1 Sequence Classification Task \n",
49
+ "3.2 Structure Prediction Task \n",
50
+ "3.3 Multi-sequence Interaction Analysis \n",
51
+ "3.4 Function Prediction Task \n",
52
+ "3.5 Regression Tasks \n",
53
+ "\n",
54
+ "#### 4 Entering the ChatGPT Era: Gene Instruction Building and Fine-tuning\n",
55
+ "4.1 Expanding LLaMA Vocabulary Based on Gene Data \n",
56
+ "4.2 Introduction to DeepSpeed Distributed Training \n",
57
+ "4.3 Continuous Pre-training of LLaMA Model Based on Gene Data \n",
58
+ "4.4 Classification Task Using LLaMA-gene Large Model \n",
59
+ "4.5 Instruction Fine-tuning Based on LLaMA-gene Large Model \n",
60
+ "\n",
61
+ "#### 5 Overview of SOTA Large Model Applications in Biology\n",
62
+ "5.1 Application of DNABERT2 \n",
63
+ "5.2 Usage of LucaOne \n",
64
+ "5.3 Usage of ESM3 \n",
65
+ "5.4 Application of MedGPT \n",
66
+ "5.5 Application of LLaMA-gene"
67
+ ]
68
+ },
69
+ {
70
+ "cell_type": "code",
71
+ "execution_count": null,
72
+ "id": "1453bac8-82dc-4f1c-869d-399c99611c52",
73
+ "metadata": {},
74
+ "outputs": [],
75
+ "source": []
76
+ }
77
+ ],
78
+ "metadata": {
79
+ "kernelspec": {
80
+ "display_name": "Python 3 (ipykernel)",
81
+ "language": "python",
82
+ "name": "python3"
83
+ },
84
+ "language_info": {
85
+ "codemirror_mode": {
86
+ "name": "ipython",
87
+ "version": 3
88
+ },
89
+ "file_extension": ".py",
90
+ "mimetype": "text/x-python",
91
+ "name": "python",
92
+ "nbconvert_exporter": "python",
93
+ "pygments_lexer": "ipython3",
94
+ "version": "3.12.3"
95
+ }
96
+ },
97
+ "nbformat": 4,
98
+ "nbformat_minor": 5
99
+ }
lecture_intro_cn.ipynb ADDED
@@ -0,0 +1,137 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "markdown",
5
+ "id": "2365faf7-39fb-4e53-a810-2e28c4f6b4c1",
6
+ "metadata": {},
7
+ "source": [
8
+ "# DNAGTP2-基因序列大模型最佳入门\n",
9
+ "\n",
10
+ "## 1 概要\n",
11
+ "自然语言大模型早已超出NLP研究领域,正在成为AI for science的基石。生物信息学中的基因序列,则是和自然语言最类似的,把大模型应用于生物序列研究,就成了最近一两年的热门研究方向,特别是2024年预测蛋白质结构的alphaFold获得诺贝尔化学奖,更是为生物学的研究指明了未来的方向。\n",
12
+ "\n",
13
+ "但对大多数从事生物学研究的工作者而言,大模型又非常陌生。事实上,在2023年之前,GPT等大模型还是NLP领域研究的小众课题,只是因为chatgpt的爆发,才进入公众视野。\n",
14
+ "\n",
15
+ "而大部生物学+大模型的研究,也都是2023年之后的工作,但领域跨度过大,这些论文一般都是大公司、大团队协作的产物,大部分研究者要学习或者重现这些工作,困难重重,我们在很多top论文的github issue中,都能感受到这一点。\n",
16
+ "\n",
17
+ "一方面,言必称大模型几乎是生物学研究确定的未来,另一方面,众多生物学研究者却在大模型的门槛前徘徊不前。如何在这个门槛前加一道梯子,就成了该领域一个迫切的需求。\n",
18
+ "\n",
19
+ "DNAGTP2就是这样的梯子,仅望能抛砖引玉,让更多的生物学工作者能够越过大模型的门槛,戴上大模型的翅膀,卷过同行。\n",
20
+ "\n",
21
+ "## 2 教程特色\n",
22
+ "本教程主要有以下特色:\n",
23
+ "\n",
24
+ "1 简单。代码简单,全部代码均为huggingface标准库构建,阅后即会。\n",
25
+ "\n",
26
+ "2 简单。理论简单,只讲最基础的网络构架,全部可视化讲解。\n",
27
+ "\n",
28
+ "3 简单。案例简单,均使用经典论文的代表性案例,一看就懂。\n",
29
+ "\n",
30
+ "\n",
31
+ "\n",
32
+ "教程内容又不简单,从基础的分词器构建,到从头构建gpt、bert等典型模型,到llama模型微调,基本的deepspeed多卡分布式训练,到lucaone、ESM3等SOTA大模型的应用,结合序列分类、结构预测、回归分析等典型生物学任务,循序渐进,逐步展开。本教程会紧跟研究趋势,不断更新。\n",
33
+ "\n",
34
+ "\n",
35
+ "\n",
36
+ "本教程面向人群:\n",
37
+ "\n",
38
+ "1 生物学领域科研人员、学生等,特别是生物信息学。\n",
39
+ "\n",
40
+ "2 大模型学习入门。不仅是生物学领域的,都可以看看,和一般大模型入门没啥差别,只是数据不同。\n",
41
+ "\n",
42
+ "## 3 教程大纲\n",
43
+ "1 数据和环境\n",
44
+ "\n",
45
+ "1.1 大模型运行环境简介\n",
46
+ "\n",
47
+ "1.2 基因相关预训练和微调数据\n",
48
+ "\n",
49
+ "1.3 datasets库基本使用\n",
50
+ "\n",
51
+ "2 从头构建DNA的GPT2/Bert大模型\n",
52
+ "\n",
53
+ "\n",
54
+ "2.1 DNA分词器构建\n",
55
+ "\n",
56
+ "2.2 从头训练dna gpt2大模型\n",
57
+ "\n",
58
+ "2.3 从头训练dna bert大模型\n",
59
+ "\n",
60
+ "2.4 基因大模型的生物序列特征提取\n",
61
+ "\n",
62
+ "2.4 基于多模态数据构建大模型\n",
63
+ "\n",
64
+ "\n",
65
+ "\n",
66
+ "3 基因大模型的生物序列任务\n",
67
+ "\n",
68
+ "3.1 序列分类任务\n",
69
+ "\n",
70
+ "3.2 序列结构预测\n",
71
+ "\n",
72
+ "3.3 多序列交互作用分析\n",
73
+ "\n",
74
+ "3.4 功能预测任务\n",
75
+ "\n",
76
+ "3.5 回归类任务\n",
77
+ "\n",
78
+ "\n",
79
+ "\n",
80
+ "4 进入chatgpt时代: 基因指令构建和微调\n",
81
+ "\n",
82
+ "4.1 基于基因数据的llama词典扩充\n",
83
+ "\n",
84
+ "4.2 deepspeed分布式训练简介\n",
85
+ "\n",
86
+ "4.3 基于基因数据的llama模型持续预训练\n",
87
+ "\n",
88
+ "4.4 基于llama-gene大模型的分类任务\n",
89
+ "\n",
90
+ "4.5 基于llama-gene大模型的指令微调\n",
91
+ "\n",
92
+ "\n",
93
+ "\n",
94
+ "5 生物领域SOTA大模型应用概要\n",
95
+ "\n",
96
+ "5.1 dnabert2应用\n",
97
+ "\n",
98
+ "5.2 lucaone使用\n",
99
+ "\n",
100
+ "5.3 ESM3使用\n",
101
+ "\n",
102
+ "5.4 Medgpt应用\n",
103
+ "\n",
104
+ "5.5 llama-gene应用"
105
+ ]
106
+ },
107
+ {
108
+ "cell_type": "code",
109
+ "execution_count": null,
110
+ "id": "3252ef0f-3193-43f3-9dcf-5d2b625dbdf7",
111
+ "metadata": {},
112
+ "outputs": [],
113
+ "source": []
114
+ }
115
+ ],
116
+ "metadata": {
117
+ "kernelspec": {
118
+ "display_name": "Python 3 (ipykernel)",
119
+ "language": "python",
120
+ "name": "python3"
121
+ },
122
+ "language_info": {
123
+ "codemirror_mode": {
124
+ "name": "ipython",
125
+ "version": 3
126
+ },
127
+ "file_extension": ".py",
128
+ "mimetype": "text/x-python",
129
+ "name": "python",
130
+ "nbconvert_exporter": "python",
131
+ "pygments_lexer": "ipython3",
132
+ "version": "3.12.3"
133
+ }
134
+ },
135
+ "nbformat": 4,
136
+ "nbformat_minor": 5
137
+ }