File size: 12,256 Bytes
83f9751 ca90249 83f9751 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 |
{
"cells": [
{
"cell_type": "markdown",
"id": "50ff8836-7075-4858-b463-c99f973f408d",
"metadata": {},
"source": [
"# 2 基因相关预训练和微调数据"
]
},
{
"cell_type": "markdown",
"id": "17cde5bb-70e5-437e-a4a3-193a881dd412",
"metadata": {},
"source": [
"本教程主要关注基因相关的生物序列数据,包括主要的DNA和蛋白质序列,data目录下数据如下:\n",
"\n",
"* dna_1g.txt DNA序列数据,大小1G,从GUE数据集中抽取,具体可参考dnabert2的论文,包括多个模式生物的数据(https://github.com/MAGICS-LAB/DNABERT_2)\n",
"* potein_1g.txt 蛋白质序列数据,大小1G,从pdb/uniprot数据库中抽取(https://www.uniprot.org/help/downloads)\n",
"* english_500m.txt 英文数据,大小500M,就是英文百科(https://huggingface.co/datasets/Salesforce/wikitext, https://huggingface.co/datasets/iohadrubin/wikitext-103-raw-v1)"
]
},
{
"cell_type": "markdown",
"id": "b45ecf27-1514-45e0-bfbd-361e6dcc98ea",
"metadata": {},
"source": [
"下面演示下huggingface的dataset库的基本用法,以及样例数据"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "2715f9bb-2e43-4bd6-8715-5c96d317bcf8",
"metadata": {},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "c067aeb8ab304723ac6b527e7ad6c768",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Generating train split: 0 examples [00:00, ? examples/s]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"DatasetDict({\n",
" train: Dataset({\n",
" features: ['text'],\n",
" num_rows: 1079595\n",
" })\n",
"})"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#读取dna数据\n",
"from datasets import load_dataset\n",
"dna_dataset = load_dataset('text', data_files='data/dna_1g.txt')\n",
"dna_dataset"
]
},
{
"cell_type": "markdown",
"id": "ec00ad72-c5f9-40db-8508-6c6bf8f374c1",
"metadata": {},
"source": [
"\n",
"Datasets 提供了加载脚本来加载本地和远程数据集。它支持几种常见的数据格式,例如:\n",
"\n",
"| Data format | Loading script | Example |\n",
"|-------------------|----------------|-------------------------------------------------------------------------|\n",
"| CSV & TSV | csv | `load_dataset(\"csv\", data_files=\"my_file.csv\")` |\n",
"| Text files | text | `load_dataset(\"text\", data_files=\"my_file.txt\")` |\n",
"| JSON & JSON Lines | json | `load_dataset(\"json\", data_files=\"my_file.jsonl\")` |\n",
"| Pickled DataFrames| pandas | `load_dataset(\"pandas\", data_files=\"my_dataframe.pkl\")` |\n",
"\n",
"如表所示, 对于每种数据格式, 我们只需要使用 load_dataset() 函数, 使用 data_files 指定一个或多个文件的路径的参数。 "
]
},
{
"cell_type": "markdown",
"id": "24c40ec7-cb59-4c3a-8052-00d7979f6208",
"metadata": {},
"source": [
"load_dataset默认加载到train下,可以把dataset当做一个一般的python dict使用"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "2a375409-d2b6-4648-8f6a-8ac3fb25bb75",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'text': 'TTAAATCCTAGAAGTTGGTTACACGGGTGAGGAAAATGGTGAGAAGCCCAATGGGATGCTGTAGCAATGACAGTGAACTGCTGTCACCCCTGAGGCTGGAAAGATAACAGACATTTGCCAGGAGCTAGAAGCTGGGGCAGCCTGGTAGGAGCGAGAATATGGTGAGAGCTGCCCCCTGGGGATGGAACCACAGAGGGAGGGTCTCTCTGATGAGACATAGAGCCAAGAACAGATACAGCCATTGTGGGAGATGGTAACCAAAGCAGAGAGAGAGAGAGAGAGCGAGAGAGAGAGAAAACACCCTGGTTTCTTCCTTCCTTCCACCTTTGAGTTTCCCACCAGTGCTTCCCATTAGCCCAAACTACCAAGAACCCAGAGGGCAAAGGAGCCCGGGAAATCTAATTCTACATGATACCGAGCAAAGCCGATGTTCCAGCTGGCTGCGTCTGTTACAGTAGGTAGTCAGGCAGACATAAGCAGGGCAGGAGAGGGCTCCTCCCAACCAGGAATGTCAGGTGACGGTCAGGTGATGGTCAGGTGGTCATTAACTGTCTCTCTAAAATAATAATTGGTTACAGCCAGCACCAGGGAAAGGCAGTCTCCCAACCGATAGAAACATCTGAAACTGATGATCAGTAGCTTCCCAATAAGGTCTCAGGAGTTGGACGCATGGGCTCAGCATGAACACTGAGAGGCAAAATGGTGGAGTTTAACTGGTATATGACCTTCCTCTAGAAACATTCAGCTGGTAAGGGAAGAACGCCTTAAGCGAATATGCACGCAACTCCAGTAAACACTGTGCATGTTCCTGTCCCAATGCTGGTAGACCACTGCGCATGCAAACAGCCCACCCCAGGGAAGAATCAGGAGAGAAGAGACCCCACAAGCATGCCAACACATAAAACCCCAAGTCAGGAGTCAAACCATGCACTTGAATCAAGTCACCCACTTAGCTCTCTTTCAAGTGTATTTTACTTTCTTTCATTCCTGCTCTAAAACT'}"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dna_dataset[\"train\"][0]"
]
},
{
"cell_type": "raw",
"id": "985bd82a-1ff0-49ef-968d-8d5f6df8d76f",
"metadata": {},
"source": [
"dna数据就是如上所示,由ATCG 4个字母组成的文本,对于学习大语言模型而言,可以不关注其具体的含义,当然,大部分dna序列的含义目前也都没有解读:)\n",
"\n",
"然后是蛋白质序列"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "94e3f443-939e-4148-bba6-6cafa90790b6",
"metadata": {},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "a1023bd5311a4a5dbe96c6c8fdc5b519",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Generating train split: 0 examples [00:00, ? examples/s]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"{'text': 'MLTDPFGRTIKLRIAVTRCLCIYCHREGESDPGTEMSAERIAEIAKAFYELGIKKLKLTGGEPLLRKDICEIISMMPDFEEISLTTGILLSDLAFDLKESGLDRVISLDTLDAETFRFITGGGELSRVLEGLRMAVEAKLTPIKLMVLMSGLESEVRKMLEFASFEETVILQLIELIPSRTGKFYLDPTIFEKDFERVAKAVKIRDMHRRKQFITPFGVVEIVKPLDTEFCMHCRIRITSDGRIKLCLMSDETVDISELSGDELKKAIFEAVKRRKPFFIMKGEILALISAVLWGFAPILDRYALLSGAPIYAALAIRAFGALIAMLFILSVLRGGLAVEAKAAVLLLIAGAIGGALAMVFYYLALESVGASRTVPITAIYPMFTALFSFLLLSEPLSPKTIAGIAFIVLGVILVSEGMVKLRGEDVVIRKYDHSMDRDKLIEMYVYDPRFRCLGLPPLSKEAIKGWIDYLGQGFAIIAEKDGKIVGHLVIVPGEREVDLTIFIHQDYQLGLGQEMMKLIIDFCRKAGFAITLVTERTARAIHVYRKLGFEIVAPYYEYDMRLQLKMIVPKGKTVLIKGTASIRGECEVLGARLFFESEKFVPVFCLEDCEIEVGEFKILDGSTIPESWEKLSKMDWETVFLYGGVDSGKSTLATYLAKVGGAYVLDLDIGQADVAPGAMGYGFAKDVVSLSKVSMIGFFVGSITPQGREAKCLRGVARLWKELRKLDGRKIIDTTGWVRGRRAKEYKLAKLEIIEPDLIASFEGKLFDWKTFEVEKGYVIRRDKDRAKARFESYRKFLDGAKTFELERDGIKLKPDFFKGKDVSQFIESVLGTRVVFARLGEEHLTICTKEDCPEYEILRELKELYEVDDIFLFSESEARFVAGLYRGKKYLGIGLIKSIDRILLECTQSDFDTIEIGEIRLEDGRECFIKRFMARIAYSYKPQDETRAARAMGYEVPISFKHAMEICRVLKGKKVPQAISFLEEVVQLKVPVPFRKHKKKVAHKIPGWYAGRYPQKAAEILKVLKLKAAEYKGLKAEELIIVHAQAKK'}"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"protein_dataset = load_dataset('text', data_files='data/protein_1g.txt')\n",
"protein_dataset[\"train\"][0]"
]
},
{
"cell_type": "markdown",
"id": "ecaa8216-7b9f-4ba0-af8e-c7c868dc7ec9",
"metadata": {},
"source": [
"蛋白质序列,则是有MLTDP等20个字母/氨基酸 组成的文本,当然,目前对蛋白质的理解远超过对DNA的。\n",
"\n",
"然后就是英文文本了,这个就比较容易看懂"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "7521f8ea-fd70-4f5b-aeeb-7ff81635320d",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'text': ' \" There \\'s Got to Be a Way \" is a song by American singer and songwriter Mariah Carey from her self @-@ titled debut studio album ( 1990 ) . Columbia released it as the fifth and final single from the album in the United Kingdom . It was one of four songs Carey wrote with Ric Wake during their first recording session together , but \" There \\'s Got to Be a Way \" was the only composition to make the final track listing . It is a socio @-@ political conscious R & B @-@ pop song which addresses the existence of poverty , racism and war in the world which gradually becomes more aspirational and positive as it progresses . The track garnered a mixed reception upon the album \\'s release in 1990 . While Carey \\'s vocals were praised , it was seen as too political . An accompanying music video highlights social injustices . The song reached number 54 on the UK Singles Chart . '}"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"english_dataset = load_dataset('text', data_files='data/english_500m.txt')\n",
"english_dataset[\"train\"][301]"
]
},
{
"cell_type": "markdown",
"id": "5fcad08d-6389-453e-997f-eb2877a5fbbb",
"metadata": {},
"source": [
"英文序列,就是26个字母组成的文本了,当然,英文是包括空格的,生物序列则没有明确的空格"
]
},
{
"cell_type": "markdown",
"id": "5e4e1e85-a187-469d-9950-1c6cbb9c41f7",
"metadata": {},
"source": [
"前面这些数据集,就是常规的文本,一般就是当做预训练数据使用,而分类等下游微调任务,一般都是包含标签的,多写成json或者csv的格式,这里也给出一个例子:"
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "c48dd04e-af42-4222-94d5-56a8e08e2cbf",
"metadata": {},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "7c611d1ab3bb408394196e7929d8e0c5",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Generating train split: 0 examples [00:00, ? examples/s]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"{'sentence1': 'ATGGAGGAAAATCAGACCATGGTCACAGAGTTCGTCCTGCTGGGATTCTGTCTTGGCCCGAGGATTCACCTAGTTCTTTTTCTGCTTTTCTCTCTCTTCTATACTCTCACCATACTGGGGAATGGGACTATCCTTGCAATGATCTGCCTGGACTCCAGACTCCACACTCCCATGTACTTCTTCCTGTCCCACCTGGCCATTGTCGATATGGCCTATGCCTGCAACACAGTGCCTCAGACACTCATAAACCTCTTGGATGAGACCAGGCCCATCACCTTTGCTGGATGCATGACACAGACCTTTCTCTTCTTGGCTTTTGCCCACACTGAATGTGTGCTCCTTGTTGTGATGTCCTATGACCGGTATGTAGCTATCTGCCACCCGCTACACTACACTGTCATCATGAACTGGAGAGTGTGTACCATTCTGGCTGCTGTTTCCTGGATATTTAGCTTTCTCCTTGCTCTGGTCCATTTAGTTCTCATCCTGAGGCTGCCCTTCTGTGGACCTCATGAAATCAATCACTTCTTCTGTGAAATCCTGTCTGTCCTCAAGCTGGCCTGTGCTGACACAACACTCAATCAGGTCGTTATCTTTGCAGCTTGTGTGTTCATATTAGTGGCCCCCCTATGCTTTGTACTAGTCTCCTACACACGCATCCTGGTGGCCATCCTGAGGATCCAGTCAGGGGAGGGACGCAGAAAGGCCTTCTCTACCTGTTCCTCCCACCTCTGTGTGGTAGGGCTCTTCTTTGGCAGTGCCATTGTCATGTACATGGCCCCCAAGTCCCAGCACCCAGAGGAGCAGCAGAAGGTTCTTTTCCTGTTTTACAGTTTTTTCAACCCCATGCTGAACCCCCTAATCTACAGTCTGAGGAATGCTGAGGTGAAGGGCGCCCTCAAGAGGTCACTGTGCAAAGAAAGTCATTCCTGGTTGGTGTGGTGTTCGGACCATAAATCTTGG',\n",
" 'sentence2': 'MEENQTMVTEFVLLGFCLGPRIHLVLFLLFSLFYTLTILGNGTILAMICLDSRLHTPMYFFLSHLAIVDMAYACNTVPQTLINLLDETRPITFAGCMTQTFLFLAFAHTECVLLVVMSYDRYVAICHPLHYTVIMNWRVCTILAAVSWIFSFLLALVHLVLILRLPFCGPHEINHFFCEILSVLKLACADTTLNQVVIFAACVFILVAPLCFVLVSYTRILVAILRIQSGEGRRKAFSTCSSHLCVVGLFFGSAIVMYMAPKSQHPEEQQKVLFLFYSFFNPMLNPLIYSLRNAEVKGALKRSLCKESHSWLVWCSDHKSW',\n",
" 'label': 1}"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ft_dataset = load_dataset('json', data_files='data/dna_protein_my.json')\n",
"ft_dataset[\"train\"][0]"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "8f3ec639-e426-4233-a20a-dad94069175b",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.3"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
|