File size: 12,051 Bytes
83f9751 |
|
{
"cells": [
{
"cell_type": "markdown",
"id": "50ff8836-7075-4858-b463-c99f973f408d",
"metadata": {},
"source": [
"# 2 基因相关预训练和微调数据"
]
},
{
"cell_type": "markdown",
"id": "17cde5bb-70e5-437e-a4a3-193a881dd412",
"metadata": {},
"source": [
"本教程主要关注基因相关的生物序列数据,包括主要的DNA和蛋白质序列,data目录下数据如下:\n",
"\n",
"* dna_1g.txt DNA序列数据,大小1G,从glue数据集中抽取,具体可参考dnabert2的论文,包括多个模式生物的数据\n",
"* potein_1g.txt 蛋白质序列数据,大小1G,从pdb数据库中抽取\n",
"* english_500m.txt 英文数据,大小500M,就是英文百科"
]
},
{
"cell_type": "markdown",
"id": "b45ecf27-1514-45e0-bfbd-361e6dcc98ea",
"metadata": {},
"source": [
"下面演示下huggingface的dataset库的基本用法,以及样例数据"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "2715f9bb-2e43-4bd6-8715-5c96d317bcf8",
"metadata": {},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "c067aeb8ab304723ac6b527e7ad6c768",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Generating train split: 0 examples [00:00, ? examples/s]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"DatasetDict({\n",
" train: Dataset({\n",
" features: ['text'],\n",
" num_rows: 1079595\n",
" })\n",
"})"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#读取dna数据\n",
"from datasets import load_dataset\n",
"dna_dataset = load_dataset('text', data_files='data/dna_1g.txt')\n",
"dna_dataset"
]
},
{
"cell_type": "markdown",
"id": "ec00ad72-c5f9-40db-8508-6c6bf8f374c1",
"metadata": {},
"source": [
"\n",
"Datasets 提供了加载脚本来加载本地和远程数据集。它支持几种常见的数据格式,例如:\n",
"\n",
"| Data format | Loading script | Example |\n",
"|-------------------|----------------|-------------------------------------------------------------------------|\n",
"| CSV & TSV | csv | `load_dataset(\"csv\", data_files=\"my_file.csv\")` |\n",
"| Text files | text | `load_dataset(\"text\", data_files=\"my_file.txt\")` |\n",
"| JSON & JSON Lines | json | `load_dataset(\"json\", data_files=\"my_file.jsonl\")` |\n",
"| Pickled DataFrames| pandas | `load_dataset(\"pandas\", data_files=\"my_dataframe.pkl\")` |\n",
"\n",
"如表所示, 对于每种数据格式, 我们只需要使用 load_dataset() 函数, 使用 data_files 指定一个或多个文件的路径的参数。 "
]
},
{
"cell_type": "markdown",
"id": "24c40ec7-cb59-4c3a-8052-00d7979f6208",
"metadata": {},
"source": [
"load_dataset默认加载到train下,可以把dataset当做一个一般的python dict使用"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "2a375409-d2b6-4648-8f6a-8ac3fb25bb75",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'text': 'TTAAATCCTAGAAGTTGGTTACACGGGTGAGGAAAATGGTGAGAAGCCCAATGGGATGCTGTAGCAATGACAGTGAACTGCTGTCACCCCTGAGGCTGGAAAGATAACAGACATTTGCCAGGAGCTAGAAGCTGGGGCAGCCTGGTAGGAGCGAGAATATGGTGAGAGCTGCCCCCTGGGGATGGAACCACAGAGGGAGGGTCTCTCTGATGAGACATAGAGCCAAGAACAGATACAGCCATTGTGGGAGATGGTAACCAAAGCAGAGAGAGAGAGAGAGAGCGAGAGAGAGAGAAAACACCCTGGTTTCTTCCTTCCTTCCACCTTTGAGTTTCCCACCAGTGCTTCCCATTAGCCCAAACTACCAAGAACCCAGAGGGCAAAGGAGCCCGGGAAATCTAATTCTACATGATACCGAGCAAAGCCGATGTTCCAGCTGGCTGCGTCTGTTACAGTAGGTAGTCAGGCAGACATAAGCAGGGCAGGAGAGGGCTCCTCCCAACCAGGAATGTCAGGTGACGGTCAGGTGATGGTCAGGTGGTCATTAACTGTCTCTCTAAAATAATAATTGGTTACAGCCAGCACCAGGGAAAGGCAGTCTCCCAACCGATAGAAACATCTGAAACTGATGATCAGTAGCTTCCCAATAAGGTCTCAGGAGTTGGACGCATGGGCTCAGCATGAACACTGAGAGGCAAAATGGTGGAGTTTAACTGGTATATGACCTTCCTCTAGAAACATTCAGCTGGTAAGGGAAGAACGCCTTAAGCGAATATGCACGCAACTCCAGTAAACACTGTGCATGTTCCTGTCCCAATGCTGGTAGACCACTGCGCATGCAAACAGCCCACCCCAGGGAAGAATCAGGAGAGAAGAGACCCCACAAGCATGCCAACACATAAAACCCCAAGTCAGGAGTCAAACCATGCACTTGAATCAAGTCACCCACTTAGCTCTCTTTCAAGTGTATTTTACTTTCTTTCATTCCTGCTCTAAAACT'}"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dna_dataset[\"train\"][0]"
]
},
{
"cell_type": "raw",
"id": "985bd82a-1ff0-49ef-968d-8d5f6df8d76f",
"metadata": {},
"source": [
"dna数据就是如上所示,由ATCG 4个字母组成的文本,对于学习大语言模型而言,可以不关注其具体的含义,当然,大部分dna序列的含义目前也都没有解读:)\n",
"\n",
"然后是蛋白质序列"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "94e3f443-939e-4148-bba6-6cafa90790b6",
"metadata": {},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "a1023bd5311a4a5dbe96c6c8fdc5b519",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Generating train split: 0 examples [00:00, ? examples/s]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"{'text': 'MLTDPFGRTIKLRIAVTRCLCIYCHREGESDPGTEMSAERIAEIAKAFYELGIKKLKLTGGEPLLRKDICEIISMMPDFEEISLTTGILLSDLAFDLKESGLDRVISLDTLDAETFRFITGGGELSRVLEGLRMAVEAKLTPIKLMVLMSGLESEVRKMLEFASFEETVILQLIELIPSRTGKFYLDPTIFEKDFERVAKAVKIRDMHRRKQFITPFGVVEIVKPLDTEFCMHCRIRITSDGRIKLCLMSDETVDISELSGDELKKAIFEAVKRRKPFFIMKGEILALISAVLWGFAPILDRYALLSGAPIYAALAIRAFGALIAMLFILSVLRGGLAVEAKAAVLLLIAGAIGGALAMVFYYLALESVGASRTVPITAIYPMFTALFSFLLLSEPLSPKTIAGIAFIVLGVILVSEGMVKLRGEDVVIRKYDHSMDRDKLIEMYVYDPRFRCLGLPPLSKEAIKGWIDYLGQGFAIIAEKDGKIVGHLVIVPGEREVDLTIFIHQDYQLGLGQEMMKLIIDFCRKAGFAITLVTERTARAIHVYRKLGFEIVAPYYEYDMRLQLKMIVPKGKTVLIKGTASIRGECEVLGARLFFESEKFVPVFCLEDCEIEVGEFKILDGSTIPESWEKLSKMDWETVFLYGGVDSGKSTLATYLAKVGGAYVLDLDIGQADVAPGAMGYGFAKDVVSLSKVSMIGFFVGSITPQGREAKCLRGVARLWKELRKLDGRKIIDTTGWVRGRRAKEYKLAKLEIIEPDLIASFEGKLFDWKTFEVEKGYVIRRDKDRAKARFESYRKFLDGAKTFELERDGIKLKPDFFKGKDVSQFIESVLGTRVVFARLGEEHLTICTKEDCPEYEILRELKELYEVDDIFLFSESEARFVAGLYRGKKYLGIGLIKSIDRILLECTQSDFDTIEIGEIRLEDGRECFIKRFMARIAYSYKPQDETRAARAMGYEVPISFKHAMEICRVLKGKKVPQAISFLEEVVQLKVPVPFRKHKKKVAHKIPGWYAGRYPQKAAEILKVLKLKAAEYKGLKAEELIIVHAQAKK'}"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"protein_dataset = load_dataset('text', data_files='data/protein_1g.txt')\n",
"protein_dataset[\"train\"][0]"
]
},
{
"cell_type": "markdown",
"id": "ecaa8216-7b9f-4ba0-af8e-c7c868dc7ec9",
"metadata": {},
"source": [
"蛋白质序列,则是有MLTDP等20个字母/氨基酸 组成的文本,当然,目前对蛋白质的理解远超过对DNA的。\n",
"\n",
"然后就是英文文本了,这个就比较容易看懂"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "7521f8ea-fd70-4f5b-aeeb-7ff81635320d",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'text': ' \" There \\'s Got to Be a Way \" is a song by American singer and songwriter Mariah Carey from her self @-@ titled debut studio album ( 1990 ) . Columbia released it as the fifth and final single from the album in the United Kingdom . It was one of four songs Carey wrote with Ric Wake during their first recording session together , but \" There \\'s Got to Be a Way \" was the only composition to make the final track listing . It is a socio @-@ political conscious R & B @-@ pop song which addresses the existence of poverty , racism and war in the world which gradually becomes more aspirational and positive as it progresses . The track garnered a mixed reception upon the album \\'s release in 1990 . While Carey \\'s vocals were praised , it was seen as too political . An accompanying music video highlights social injustices . The song reached number 54 on the UK Singles Chart . '}"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"english_dataset = load_dataset('text', data_files='data/english_500m.txt')\n",
"english_dataset[\"train\"][301]"
]
},
{
"cell_type": "markdown",
"id": "5fcad08d-6389-453e-997f-eb2877a5fbbb",
"metadata": {},
"source": [
"英文序列,就是26个字母组成的文本了,当然,英文是包括空格的,生物序列则没有明确的空格"
]
},
{
"cell_type": "markdown",
"id": "5e4e1e85-a187-469d-9950-1c6cbb9c41f7",
"metadata": {},
"source": [
"前面这些数据集,就是常规的文本,一般就是当做预训练数据使用,而分类等下游微调任务,一般都是包含标签的,多写成json或者csv的格式,这里也给出一个例子:"
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "c48dd04e-af42-4222-94d5-56a8e08e2cbf",
"metadata": {},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "7c611d1ab3bb408394196e7929d8e0c5",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Generating train split: 0 examples [00:00, ? examples/s]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"{'sentence1': 'ATGGAGGAAAATCAGACCATGGTCACAGAGTTCGTCCTGCTGGGATTCTGTCTTGGCCCGAGGATTCACCTAGTTCTTTTTCTGCTTTTCTCTCTCTTCTATACTCTCACCATACTGGGGAATGGGACTATCCTTGCAATGATCTGCCTGGACTCCAGACTCCACACTCCCATGTACTTCTTCCTGTCCCACCTGGCCATTGTCGATATGGCCTATGCCTGCAACACAGTGCCTCAGACACTCATAAACCTCTTGGATGAGACCAGGCCCATCACCTTTGCTGGATGCATGACACAGACCTTTCTCTTCTTGGCTTTTGCCCACACTGAATGTGTGCTCCTTGTTGTGATGTCCTATGACCGGTATGTAGCTATCTGCCACCCGCTACACTACACTGTCATCATGAACTGGAGAGTGTGTACCATTCTGGCTGCTGTTTCCTGGATATTTAGCTTTCTCCTTGCTCTGGTCCATTTAGTTCTCATCCTGAGGCTGCCCTTCTGTGGACCTCATGAAATCAATCACTTCTTCTGTGAAATCCTGTCTGTCCTCAAGCTGGCCTGTGCTGACACAACACTCAATCAGGTCGTTATCTTTGCAGCTTGTGTGTTCATATTAGTGGCCCCCCTATGCTTTGTACTAGTCTCCTACACACGCATCCTGGTGGCCATCCTGAGGATCCAGTCAGGGGAGGGACGCAGAAAGGCCTTCTCTACCTGTTCCTCCCACCTCTGTGTGGTAGGGCTCTTCTTTGGCAGTGCCATTGTCATGTACATGGCCCCCAAGTCCCAGCACCCAGAGGAGCAGCAGAAGGTTCTTTTCCTGTTTTACAGTTTTTTCAACCCCATGCTGAACCCCCTAATCTACAGTCTGAGGAATGCTGAGGTGAAGGGCGCCCTCAAGAGGTCACTGTGCAAAGAAAGTCATTCCTGGTTGGTGTGGTGTTCGGACCATAAATCTTGG',\n",
" 'sentence2': 'MEENQTMVTEFVLLGFCLGPRIHLVLFLLFSLFYTLTILGNGTILAMICLDSRLHTPMYFFLSHLAIVDMAYACNTVPQTLINLLDETRPITFAGCMTQTFLFLAFAHTECVLLVVMSYDRYVAICHPLHYTVIMNWRVCTILAAVSWIFSFLLALVHLVLILRLPFCGPHEINHFFCEILSVLKLACADTTLNQVVIFAACVFILVAPLCFVLVSYTRILVAILRIQSGEGRRKAFSTCSSHLCVVGLFFGSAIVMYMAPKSQHPEEQQKVLFLFYSFFNPMLNPLIYSLRNAEVKGALKRSLCKESHSWLVWCSDHKSW',\n",
" 'label': 1}"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ft_dataset = load_dataset('json', data_files='data/dna_protein_my.json')\n",
"ft_dataset[\"train\"][0]"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "8f3ec639-e426-4233-a20a-dad94069175b",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.3"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
|