{ "cells": [ { "cell_type": "markdown", "id": "50ff8836-7075-4858-b463-c99f973f408d", "metadata": {}, "source": [ "# 2 基因相关预训练和微调数据" ] }, { "cell_type": "markdown", "id": "17cde5bb-70e5-437e-a4a3-193a881dd412", "metadata": {}, "source": [ "本教程主要关注基因相关的生物序列数据,包括主要的DNA和蛋白质序列,data目录下数据如下:\n", "\n", "* dna_1g.txt DNA序列数据,大小1G,从glue数据集中抽取,具体可参考dnabert2的论文,包括多个模式生物的数据\n", "* potein_1g.txt 蛋白质序列数据,大小1G,从pdb数据库中抽取\n", "* english_500m.txt 英文数据,大小500M,就是英文百科" ] }, { "cell_type": "markdown", "id": "b45ecf27-1514-45e0-bfbd-361e6dcc98ea", "metadata": {}, "source": [ "下面演示下huggingface的dataset库的基本用法,以及样例数据" ] }, { "cell_type": "code", "execution_count": 3, "id": "2715f9bb-2e43-4bd6-8715-5c96d317bcf8", "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "c067aeb8ab304723ac6b527e7ad6c768", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Generating train split: 0 examples [00:00, ? examples/s]" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "DatasetDict({\n", " train: Dataset({\n", " features: ['text'],\n", " num_rows: 1079595\n", " })\n", "})" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#读取dna数据\n", "from datasets import load_dataset\n", "dna_dataset = load_dataset('text', data_files='data/dna_1g.txt')\n", "dna_dataset" ] }, { "cell_type": "markdown", "id": "ec00ad72-c5f9-40db-8508-6c6bf8f374c1", "metadata": {}, "source": [ "\n", "Datasets 提供了加载脚本来加载本地和远程数据集。它支持几种常见的数据格式,例如:\n", "\n", "| Data format | Loading script | Example |\n", "|-------------------|----------------|-------------------------------------------------------------------------|\n", "| CSV & TSV | csv | `load_dataset(\"csv\", data_files=\"my_file.csv\")` |\n", "| Text files | text | `load_dataset(\"text\", data_files=\"my_file.txt\")` |\n", "| JSON & JSON Lines | json | `load_dataset(\"json\", data_files=\"my_file.jsonl\")` |\n", "| Pickled DataFrames| pandas | `load_dataset(\"pandas\", data_files=\"my_dataframe.pkl\")` |\n", "\n", "如表所示, 对于每种数据格式, 我们只需要使用 load_dataset() 函数, 使用 data_files 指定一个或多个文件的路径的参数。 " ] }, { "cell_type": "markdown", "id": "24c40ec7-cb59-4c3a-8052-00d7979f6208", "metadata": {}, "source": [ "load_dataset默认加载到train下,可以把dataset当做一个一般的python dict使用" ] }, { "cell_type": "code", "execution_count": 4, "id": "2a375409-d2b6-4648-8f6a-8ac3fb25bb75", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'text': 'TTAAATCCTAGAAGTTGGTTACACGGGTGAGGAAAATGGTGAGAAGCCCAATGGGATGCTGTAGCAATGACAGTGAACTGCTGTCACCCCTGAGGCTGGAAAGATAACAGACATTTGCCAGGAGCTAGAAGCTGGGGCAGCCTGGTAGGAGCGAGAATATGGTGAGAGCTGCCCCCTGGGGATGGAACCACAGAGGGAGGGTCTCTCTGATGAGACATAGAGCCAAGAACAGATACAGCCATTGTGGGAGATGGTAACCAAAGCAGAGAGAGAGAGAGAGAGCGAGAGAGAGAGAAAACACCCTGGTTTCTTCCTTCCTTCCACCTTTGAGTTTCCCACCAGTGCTTCCCATTAGCCCAAACTACCAAGAACCCAGAGGGCAAAGGAGCCCGGGAAATCTAATTCTACATGATACCGAGCAAAGCCGATGTTCCAGCTGGCTGCGTCTGTTACAGTAGGTAGTCAGGCAGACATAAGCAGGGCAGGAGAGGGCTCCTCCCAACCAGGAATGTCAGGTGACGGTCAGGTGATGGTCAGGTGGTCATTAACTGTCTCTCTAAAATAATAATTGGTTACAGCCAGCACCAGGGAAAGGCAGTCTCCCAACCGATAGAAACATCTGAAACTGATGATCAGTAGCTTCCCAATAAGGTCTCAGGAGTTGGACGCATGGGCTCAGCATGAACACTGAGAGGCAAAATGGTGGAGTTTAACTGGTATATGACCTTCCTCTAGAAACATTCAGCTGGTAAGGGAAGAACGCCTTAAGCGAATATGCACGCAACTCCAGTAAACACTGTGCATGTTCCTGTCCCAATGCTGGTAGACCACTGCGCATGCAAACAGCCCACCCCAGGGAAGAATCAGGAGAGAAGAGACCCCACAAGCATGCCAACACATAAAACCCCAAGTCAGGAGTCAAACCATGCACTTGAATCAAGTCACCCACTTAGCTCTCTTTCAAGTGTATTTTACTTTCTTTCATTCCTGCTCTAAAACT'}" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dna_dataset[\"train\"][0]" ] }, { "cell_type": "raw", "id": "985bd82a-1ff0-49ef-968d-8d5f6df8d76f", "metadata": {}, "source": [ "dna数据就是如上所示,由ATCG 4个字母组成的文本,对于学习大语言模型而言,可以不关注其具体的含义,当然,大部分dna序列的含义目前也都没有解读:)\n", "\n", "然后是蛋白质序列" ] }, { "cell_type": "code", "execution_count": 5, "id": "94e3f443-939e-4148-bba6-6cafa90790b6", "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "a1023bd5311a4a5dbe96c6c8fdc5b519", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Generating train split: 0 examples [00:00, ? examples/s]" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "{'text': 'MLTDPFGRTIKLRIAVTRCLCIYCHREGESDPGTEMSAERIAEIAKAFYELGIKKLKLTGGEPLLRKDICEIISMMPDFEEISLTTGILLSDLAFDLKESGLDRVISLDTLDAETFRFITGGGELSRVLEGLRMAVEAKLTPIKLMVLMSGLESEVRKMLEFASFEETVILQLIELIPSRTGKFYLDPTIFEKDFERVAKAVKIRDMHRRKQFITPFGVVEIVKPLDTEFCMHCRIRITSDGRIKLCLMSDETVDISELSGDELKKAIFEAVKRRKPFFIMKGEILALISAVLWGFAPILDRYALLSGAPIYAALAIRAFGALIAMLFILSVLRGGLAVEAKAAVLLLIAGAIGGALAMVFYYLALESVGASRTVPITAIYPMFTALFSFLLLSEPLSPKTIAGIAFIVLGVILVSEGMVKLRGEDVVIRKYDHSMDRDKLIEMYVYDPRFRCLGLPPLSKEAIKGWIDYLGQGFAIIAEKDGKIVGHLVIVPGEREVDLTIFIHQDYQLGLGQEMMKLIIDFCRKAGFAITLVTERTARAIHVYRKLGFEIVAPYYEYDMRLQLKMIVPKGKTVLIKGTASIRGECEVLGARLFFESEKFVPVFCLEDCEIEVGEFKILDGSTIPESWEKLSKMDWETVFLYGGVDSGKSTLATYLAKVGGAYVLDLDIGQADVAPGAMGYGFAKDVVSLSKVSMIGFFVGSITPQGREAKCLRGVARLWKELRKLDGRKIIDTTGWVRGRRAKEYKLAKLEIIEPDLIASFEGKLFDWKTFEVEKGYVIRRDKDRAKARFESYRKFLDGAKTFELERDGIKLKPDFFKGKDVSQFIESVLGTRVVFARLGEEHLTICTKEDCPEYEILRELKELYEVDDIFLFSESEARFVAGLYRGKKYLGIGLIKSIDRILLECTQSDFDTIEIGEIRLEDGRECFIKRFMARIAYSYKPQDETRAARAMGYEVPISFKHAMEICRVLKGKKVPQAISFLEEVVQLKVPVPFRKHKKKVAHKIPGWYAGRYPQKAAEILKVLKLKAAEYKGLKAEELIIVHAQAKK'}" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "protein_dataset = load_dataset('text', data_files='data/protein_1g.txt')\n", "protein_dataset[\"train\"][0]" ] }, { "cell_type": "markdown", "id": "ecaa8216-7b9f-4ba0-af8e-c7c868dc7ec9", "metadata": {}, "source": [ "蛋白质序列,则是有MLTDP等20个字母/氨基酸 组成的文本,当然,目前对蛋白质的理解远超过对DNA的。\n", "\n", "然后就是英文文本了,这个就比较容易看懂" ] }, { "cell_type": "code", "execution_count": 9, "id": "7521f8ea-fd70-4f5b-aeeb-7ff81635320d", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'text': ' \" There \\'s Got to Be a Way \" is a song by American singer and songwriter Mariah Carey from her self @-@ titled debut studio album ( 1990 ) . Columbia released it as the fifth and final single from the album in the United Kingdom . It was one of four songs Carey wrote with Ric Wake during their first recording session together , but \" There \\'s Got to Be a Way \" was the only composition to make the final track listing . It is a socio @-@ political conscious R & B @-@ pop song which addresses the existence of poverty , racism and war in the world which gradually becomes more aspirational and positive as it progresses . The track garnered a mixed reception upon the album \\'s release in 1990 . While Carey \\'s vocals were praised , it was seen as too political . An accompanying music video highlights social injustices . The song reached number 54 on the UK Singles Chart . '}" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "english_dataset = load_dataset('text', data_files='data/english_500m.txt')\n", "english_dataset[\"train\"][301]" ] }, { "cell_type": "markdown", "id": "5fcad08d-6389-453e-997f-eb2877a5fbbb", "metadata": {}, "source": [ "英文序列,就是26个字母组成的文本了,当然,英文是包括空格的,生物序列则没有明确的空格" ] }, { "cell_type": "markdown", "id": "5e4e1e85-a187-469d-9950-1c6cbb9c41f7", "metadata": {}, "source": [ "前面这些数据集,就是常规的文本,一般就是当做预训练数据使用,而分类等下游微调任务,一般都是包含标签的,多写成json或者csv的格式,这里也给出一个例子:" ] }, { "cell_type": "code", "execution_count": 11, "id": "c48dd04e-af42-4222-94d5-56a8e08e2cbf", "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "7c611d1ab3bb408394196e7929d8e0c5", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Generating train split: 0 examples [00:00, ? examples/s]" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "{'sentence1': 'ATGGAGGAAAATCAGACCATGGTCACAGAGTTCGTCCTGCTGGGATTCTGTCTTGGCCCGAGGATTCACCTAGTTCTTTTTCTGCTTTTCTCTCTCTTCTATACTCTCACCATACTGGGGAATGGGACTATCCTTGCAATGATCTGCCTGGACTCCAGACTCCACACTCCCATGTACTTCTTCCTGTCCCACCTGGCCATTGTCGATATGGCCTATGCCTGCAACACAGTGCCTCAGACACTCATAAACCTCTTGGATGAGACCAGGCCCATCACCTTTGCTGGATGCATGACACAGACCTTTCTCTTCTTGGCTTTTGCCCACACTGAATGTGTGCTCCTTGTTGTGATGTCCTATGACCGGTATGTAGCTATCTGCCACCCGCTACACTACACTGTCATCATGAACTGGAGAGTGTGTACCATTCTGGCTGCTGTTTCCTGGATATTTAGCTTTCTCCTTGCTCTGGTCCATTTAGTTCTCATCCTGAGGCTGCCCTTCTGTGGACCTCATGAAATCAATCACTTCTTCTGTGAAATCCTGTCTGTCCTCAAGCTGGCCTGTGCTGACACAACACTCAATCAGGTCGTTATCTTTGCAGCTTGTGTGTTCATATTAGTGGCCCCCCTATGCTTTGTACTAGTCTCCTACACACGCATCCTGGTGGCCATCCTGAGGATCCAGTCAGGGGAGGGACGCAGAAAGGCCTTCTCTACCTGTTCCTCCCACCTCTGTGTGGTAGGGCTCTTCTTTGGCAGTGCCATTGTCATGTACATGGCCCCCAAGTCCCAGCACCCAGAGGAGCAGCAGAAGGTTCTTTTCCTGTTTTACAGTTTTTTCAACCCCATGCTGAACCCCCTAATCTACAGTCTGAGGAATGCTGAGGTGAAGGGCGCCCTCAAGAGGTCACTGTGCAAAGAAAGTCATTCCTGGTTGGTGTGGTGTTCGGACCATAAATCTTGG',\n", " 'sentence2': 'MEENQTMVTEFVLLGFCLGPRIHLVLFLLFSLFYTLTILGNGTILAMICLDSRLHTPMYFFLSHLAIVDMAYACNTVPQTLINLLDETRPITFAGCMTQTFLFLAFAHTECVLLVVMSYDRYVAICHPLHYTVIMNWRVCTILAAVSWIFSFLLALVHLVLILRLPFCGPHEINHFFCEILSVLKLACADTTLNQVVIFAACVFILVAPLCFVLVSYTRILVAILRIQSGEGRRKAFSTCSSHLCVVGLFFGSAIVMYMAPKSQHPEEQQKVLFLFYSFFNPMLNPLIYSLRNAEVKGALKRSLCKESHSWLVWCSDHKSW',\n", " 'label': 1}" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ft_dataset = load_dataset('json', data_files='data/dna_protein_my.json')\n", "ft_dataset[\"train\"][0]" ] }, { "cell_type": "code", "execution_count": null, "id": "8f3ec639-e426-4233-a20a-dad94069175b", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.3" } }, "nbformat": 4, "nbformat_minor": 5 }