File size: 12,256 Bytes
83f9751
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ca90249
 
 
83f9751
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "50ff8836-7075-4858-b463-c99f973f408d",
   "metadata": {},
   "source": [
    "# 2 基因相关预训练和微调数据"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "17cde5bb-70e5-437e-a4a3-193a881dd412",
   "metadata": {},
   "source": [
    "本教程主要关注基因相关的生物序列数据,包括主要的DNA和蛋白质序列,data目录下数据如下:\n",
    "\n",
    "* dna_1g.txt DNA序列数据,大小1G,从GUE数据集中抽取,具体可参考dnabert2的论文,包括多个模式生物的数据(https://github.com/MAGICS-LAB/DNABERT_2)\n",
    "* potein_1g.txt 蛋白质序列数据,大小1G,从pdb/uniprot数据库中抽取(https://www.uniprot.org/help/downloads)\n",
    "* english_500m.txt  英文数据,大小500M,就是英文百科(https://huggingface.co/datasets/Salesforce/wikitext, https://huggingface.co/datasets/iohadrubin/wikitext-103-raw-v1)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b45ecf27-1514-45e0-bfbd-361e6dcc98ea",
   "metadata": {},
   "source": [
    "下面演示下huggingface的dataset库的基本用法,以及样例数据"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "2715f9bb-2e43-4bd6-8715-5c96d317bcf8",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "c067aeb8ab304723ac6b527e7ad6c768",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Generating train split: 0 examples [00:00, ? examples/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": [
       "DatasetDict({\n",
       "    train: Dataset({\n",
       "        features: ['text'],\n",
       "        num_rows: 1079595\n",
       "    })\n",
       "})"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#读取dna数据\n",
    "from datasets import load_dataset\n",
    "dna_dataset = load_dataset('text', data_files='data/dna_1g.txt')\n",
    "dna_dataset"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ec00ad72-c5f9-40db-8508-6c6bf8f374c1",
   "metadata": {},
   "source": [
    "\n",
    "Datasets 提供了加载脚本来加载本地和远程数据集。它支持几种常见的数据格式,例如:\n",
    "\n",
    "| Data format       | Loading script | Example                                                                 |\n",
    "|-------------------|----------------|-------------------------------------------------------------------------|\n",
    "| CSV & TSV         | csv            | `load_dataset(\"csv\", data_files=\"my_file.csv\")`                         |\n",
    "| Text files        | text           | `load_dataset(\"text\", data_files=\"my_file.txt\")`                        |\n",
    "| JSON & JSON Lines | json           | `load_dataset(\"json\", data_files=\"my_file.jsonl\")`                      |\n",
    "| Pickled DataFrames| pandas         | `load_dataset(\"pandas\", data_files=\"my_dataframe.pkl\")`                |\n",
    "\n",
    "如表所示, 对于每种数据格式, 我们只需要使用 load_dataset() 函数, 使用 data_files 指定一个或多个文件的路径的参数。 "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "24c40ec7-cb59-4c3a-8052-00d7979f6208",
   "metadata": {},
   "source": [
    "load_dataset默认加载到train下,可以把dataset当做一个一般的python dict使用"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "2a375409-d2b6-4648-8f6a-8ac3fb25bb75",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'text': 'TTAAATCCTAGAAGTTGGTTACACGGGTGAGGAAAATGGTGAGAAGCCCAATGGGATGCTGTAGCAATGACAGTGAACTGCTGTCACCCCTGAGGCTGGAAAGATAACAGACATTTGCCAGGAGCTAGAAGCTGGGGCAGCCTGGTAGGAGCGAGAATATGGTGAGAGCTGCCCCCTGGGGATGGAACCACAGAGGGAGGGTCTCTCTGATGAGACATAGAGCCAAGAACAGATACAGCCATTGTGGGAGATGGTAACCAAAGCAGAGAGAGAGAGAGAGAGCGAGAGAGAGAGAAAACACCCTGGTTTCTTCCTTCCTTCCACCTTTGAGTTTCCCACCAGTGCTTCCCATTAGCCCAAACTACCAAGAACCCAGAGGGCAAAGGAGCCCGGGAAATCTAATTCTACATGATACCGAGCAAAGCCGATGTTCCAGCTGGCTGCGTCTGTTACAGTAGGTAGTCAGGCAGACATAAGCAGGGCAGGAGAGGGCTCCTCCCAACCAGGAATGTCAGGTGACGGTCAGGTGATGGTCAGGTGGTCATTAACTGTCTCTCTAAAATAATAATTGGTTACAGCCAGCACCAGGGAAAGGCAGTCTCCCAACCGATAGAAACATCTGAAACTGATGATCAGTAGCTTCCCAATAAGGTCTCAGGAGTTGGACGCATGGGCTCAGCATGAACACTGAGAGGCAAAATGGTGGAGTTTAACTGGTATATGACCTTCCTCTAGAAACATTCAGCTGGTAAGGGAAGAACGCCTTAAGCGAATATGCACGCAACTCCAGTAAACACTGTGCATGTTCCTGTCCCAATGCTGGTAGACCACTGCGCATGCAAACAGCCCACCCCAGGGAAGAATCAGGAGAGAAGAGACCCCACAAGCATGCCAACACATAAAACCCCAAGTCAGGAGTCAAACCATGCACTTGAATCAAGTCACCCACTTAGCTCTCTTTCAAGTGTATTTTACTTTCTTTCATTCCTGCTCTAAAACT'}"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dna_dataset[\"train\"][0]"
   ]
  },
  {
   "cell_type": "raw",
   "id": "985bd82a-1ff0-49ef-968d-8d5f6df8d76f",
   "metadata": {},
   "source": [
    "dna数据就是如上所示,由ATCG 4个字母组成的文本,对于学习大语言模型而言,可以不关注其具体的含义,当然,大部分dna序列的含义目前也都没有解读:)\n",
    "\n",
    "然后是蛋白质序列"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "94e3f443-939e-4148-bba6-6cafa90790b6",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "a1023bd5311a4a5dbe96c6c8fdc5b519",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Generating train split: 0 examples [00:00, ? examples/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": [
       "{'text': 'MLTDPFGRTIKLRIAVTRCLCIYCHREGESDPGTEMSAERIAEIAKAFYELGIKKLKLTGGEPLLRKDICEIISMMPDFEEISLTTGILLSDLAFDLKESGLDRVISLDTLDAETFRFITGGGELSRVLEGLRMAVEAKLTPIKLMVLMSGLESEVRKMLEFASFEETVILQLIELIPSRTGKFYLDPTIFEKDFERVAKAVKIRDMHRRKQFITPFGVVEIVKPLDTEFCMHCRIRITSDGRIKLCLMSDETVDISELSGDELKKAIFEAVKRRKPFFIMKGEILALISAVLWGFAPILDRYALLSGAPIYAALAIRAFGALIAMLFILSVLRGGLAVEAKAAVLLLIAGAIGGALAMVFYYLALESVGASRTVPITAIYPMFTALFSFLLLSEPLSPKTIAGIAFIVLGVILVSEGMVKLRGEDVVIRKYDHSMDRDKLIEMYVYDPRFRCLGLPPLSKEAIKGWIDYLGQGFAIIAEKDGKIVGHLVIVPGEREVDLTIFIHQDYQLGLGQEMMKLIIDFCRKAGFAITLVTERTARAIHVYRKLGFEIVAPYYEYDMRLQLKMIVPKGKTVLIKGTASIRGECEVLGARLFFESEKFVPVFCLEDCEIEVGEFKILDGSTIPESWEKLSKMDWETVFLYGGVDSGKSTLATYLAKVGGAYVLDLDIGQADVAPGAMGYGFAKDVVSLSKVSMIGFFVGSITPQGREAKCLRGVARLWKELRKLDGRKIIDTTGWVRGRRAKEYKLAKLEIIEPDLIASFEGKLFDWKTFEVEKGYVIRRDKDRAKARFESYRKFLDGAKTFELERDGIKLKPDFFKGKDVSQFIESVLGTRVVFARLGEEHLTICTKEDCPEYEILRELKELYEVDDIFLFSESEARFVAGLYRGKKYLGIGLIKSIDRILLECTQSDFDTIEIGEIRLEDGRECFIKRFMARIAYSYKPQDETRAARAMGYEVPISFKHAMEICRVLKGKKVPQAISFLEEVVQLKVPVPFRKHKKKVAHKIPGWYAGRYPQKAAEILKVLKLKAAEYKGLKAEELIIVHAQAKK'}"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "protein_dataset = load_dataset('text', data_files='data/protein_1g.txt')\n",
    "protein_dataset[\"train\"][0]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ecaa8216-7b9f-4ba0-af8e-c7c868dc7ec9",
   "metadata": {},
   "source": [
    "蛋白质序列,则是有MLTDP等20个字母/氨基酸 组成的文本,当然,目前对蛋白质的理解远超过对DNA的。\n",
    "\n",
    "然后就是英文文本了,这个就比较容易看懂"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "7521f8ea-fd70-4f5b-aeeb-7ff81635320d",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'text': ' \" There \\'s Got to Be a Way \" is a song by American singer and songwriter Mariah Carey from her self @-@ titled debut studio album ( 1990 ) . Columbia released it as the fifth and final single from the album in the United Kingdom . It was one of four songs Carey wrote with Ric Wake during their first recording session together , but \" There \\'s Got to Be a Way \" was the only composition to make the final track listing . It is a socio @-@ political conscious R & B @-@ pop song which addresses the existence of poverty , racism and war in the world which gradually becomes more aspirational and positive as it progresses . The track garnered a mixed reception upon the album \\'s release in 1990 . While Carey \\'s vocals were praised , it was seen as too political . An accompanying music video highlights social injustices . The song reached number 54 on the UK Singles Chart . '}"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "english_dataset = load_dataset('text', data_files='data/english_500m.txt')\n",
    "english_dataset[\"train\"][301]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5fcad08d-6389-453e-997f-eb2877a5fbbb",
   "metadata": {},
   "source": [
    "英文序列,就是26个字母组成的文本了,当然,英文是包括空格的,生物序列则没有明确的空格"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5e4e1e85-a187-469d-9950-1c6cbb9c41f7",
   "metadata": {},
   "source": [
    "前面这些数据集,就是常规的文本,一般就是当做预训练数据使用,而分类等下游微调任务,一般都是包含标签的,多写成json或者csv的格式,这里也给出一个例子:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "c48dd04e-af42-4222-94d5-56a8e08e2cbf",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "7c611d1ab3bb408394196e7929d8e0c5",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Generating train split: 0 examples [00:00, ? examples/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": [
       "{'sentence1': 'ATGGAGGAAAATCAGACCATGGTCACAGAGTTCGTCCTGCTGGGATTCTGTCTTGGCCCGAGGATTCACCTAGTTCTTTTTCTGCTTTTCTCTCTCTTCTATACTCTCACCATACTGGGGAATGGGACTATCCTTGCAATGATCTGCCTGGACTCCAGACTCCACACTCCCATGTACTTCTTCCTGTCCCACCTGGCCATTGTCGATATGGCCTATGCCTGCAACACAGTGCCTCAGACACTCATAAACCTCTTGGATGAGACCAGGCCCATCACCTTTGCTGGATGCATGACACAGACCTTTCTCTTCTTGGCTTTTGCCCACACTGAATGTGTGCTCCTTGTTGTGATGTCCTATGACCGGTATGTAGCTATCTGCCACCCGCTACACTACACTGTCATCATGAACTGGAGAGTGTGTACCATTCTGGCTGCTGTTTCCTGGATATTTAGCTTTCTCCTTGCTCTGGTCCATTTAGTTCTCATCCTGAGGCTGCCCTTCTGTGGACCTCATGAAATCAATCACTTCTTCTGTGAAATCCTGTCTGTCCTCAAGCTGGCCTGTGCTGACACAACACTCAATCAGGTCGTTATCTTTGCAGCTTGTGTGTTCATATTAGTGGCCCCCCTATGCTTTGTACTAGTCTCCTACACACGCATCCTGGTGGCCATCCTGAGGATCCAGTCAGGGGAGGGACGCAGAAAGGCCTTCTCTACCTGTTCCTCCCACCTCTGTGTGGTAGGGCTCTTCTTTGGCAGTGCCATTGTCATGTACATGGCCCCCAAGTCCCAGCACCCAGAGGAGCAGCAGAAGGTTCTTTTCCTGTTTTACAGTTTTTTCAACCCCATGCTGAACCCCCTAATCTACAGTCTGAGGAATGCTGAGGTGAAGGGCGCCCTCAAGAGGTCACTGTGCAAAGAAAGTCATTCCTGGTTGGTGTGGTGTTCGGACCATAAATCTTGG',\n",
       " 'sentence2': 'MEENQTMVTEFVLLGFCLGPRIHLVLFLLFSLFYTLTILGNGTILAMICLDSRLHTPMYFFLSHLAIVDMAYACNTVPQTLINLLDETRPITFAGCMTQTFLFLAFAHTECVLLVVMSYDRYVAICHPLHYTVIMNWRVCTILAAVSWIFSFLLALVHLVLILRLPFCGPHEINHFFCEILSVLKLACADTTLNQVVIFAACVFILVAPLCFVLVSYTRILVAILRIQSGEGRRKAFSTCSSHLCVVGLFFGSAIVMYMAPKSQHPEEQQKVLFLFYSFFNPMLNPLIYSLRNAEVKGALKRSLCKESHSWLVWCSDHKSW',\n",
       " 'label': 1}"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ft_dataset = load_dataset('json', data_files='data/dna_protein_my.json')\n",
    "ft_dataset[\"train\"][0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8f3ec639-e426-4233-a20a-dad94069175b",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}