# 2 基因相关预训练和微调数据

本教程主要关注基因相关的生物序列数据，包括主要的DNA和蛋白质序列，data目录下数据如下：

* dna_1g.txt DNA序列数据，大小1G，从glue数据集中抽取，具体可参考dnabert2的论文，包括多个模式生物的数据
* potein_1g.txt 蛋白质序列数据，大小1G，从pdb数据库中抽取
* english_500m.txt  英文数据，大小500M，就是英文百科

下面演示下huggingface的dataset库的基本用法，以及样例数据

In [3]:
#读取dna数据
from datasets import load_dataset
dna_dataset = load_dataset('text', data_files='data/dna_1g.txt')
dna_dataset

Generating train split: 0 examples [00:00, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 1079595
    })
})


Datasets 提供了加载脚本来加载本地和远程数据集。它支持几种常见的数据格式，例如：

| Data format       | Loading script | Example                                                                 |
|-------------------|----------------|-------------------------------------------------------------------------|
| CSV & TSV         | csv            | `load_dataset("csv", data_files="my_file.csv")`                         |
| Text files        | text           | `load_dataset("text", data_files="my_file.txt")`                        |
| JSON & JSON Lines | json           | `load_dataset("json", data_files="my_file.jsonl")`                      |
| Pickled DataFrames| pandas         | `load_dataset("pandas", data_files="my_dataframe.pkl")`                |

如表所示, 对于每种数据格式, 我们只需要使用 load_dataset() 函数, 使用 data_files 指定一个或多个文件的路径的参数。 

load_dataset默认加载到train下，可以把dataset当做一个一般的python dict使用

In [4]:
dna_dataset["train"][0]

{'text': 'TTAAATCCTAGAAGTTGGTTACACGGGTGAGGAAAATGGTGAGAAGCCCAATGGGATGCTGTAGCAATGACAGTGAACTGCTGTCACCCCTGAGGCTGGAAAGATAACAGACATTTGCCAGGAGCTAGAAGCTGGGGCAGCCTGGTAGGAGCGAGAATATGGTGAGAGCTGCCCCCTGGGGATGGAACCACAGAGGGAGGGTCTCTCTGATGAGACATAGAGCCAAGAACAGATACAGCCATTGTGGGAGATGGTAACCAAAGCAGAGAGAGAGAGAGAGAGCGAGAGAGAGAGAAAACACCCTGGTTTCTTCCTTCCTTCCACCTTTGAGTTTCCCACCAGTGCTTCCCATTAGCCCAAACTACCAAGAACCCAGAGGGCAAAGGAGCCCGGGAAATCTAATTCTACATGATACCGAGCAAAGCCGATGTTCCAGCTGGCTGCGTCTGTTACAGTAGGTAGTCAGGCAGACATAAGCAGGGCAGGAGAGGGCTCCTCCCAACCAGGAATGTCAGGTGACGGTCAGGTGATGGTCAGGTGGTCATTAACTGTCTCTCTAAAATAATAATTGGTTACAGCCAGCACCAGGGAAAGGCAGTCTCCCAACCGATAGAAACATCTGAAACTGATGATCAGTAGCTTCCCAATAAGGTCTCAGGAGTTGGACGCATGGGCTCAGCATGAACACTGAGAGGCAAAATGGTGGAGTTTAACTGGTATATGACCTTCCTCTAGAAACATTCAGCTGGTAAGGGAAGAACGCCTTAAGCGAATATGCACGCAACTCCAGTAAACACTGTGCATGTTCCTGTCCCAATGCTGGTAGACCACTGCGCATGCAAACAGCCCACCCCAGGGAAGAATCAGGAGAGAAGAGACCCCACAAGCATGCCAACACATAAAACCCCAAGTCAGGAGTCAAACCATGCACTTGAATCAAGTCACCCACTTAGCTCTCTTTCAAGTGTATTTTACTTTCTTTCATTCCTG

In [5]:
protein_dataset = load_dataset('text', data_files='data/protein_1g.txt')
protein_dataset["train"][0]

Generating train split: 0 examples [00:00, ? examples/s]

{'text': 'MLTDPFGRTIKLRIAVTRCLCIYCHREGESDPGTEMSAERIAEIAKAFYELGIKKLKLTGGEPLLRKDICEIISMMPDFEEISLTTGILLSDLAFDLKESGLDRVISLDTLDAETFRFITGGGELSRVLEGLRMAVEAKLTPIKLMVLMSGLESEVRKMLEFASFEETVILQLIELIPSRTGKFYLDPTIFEKDFERVAKAVKIRDMHRRKQFITPFGVVEIVKPLDTEFCMHCRIRITSDGRIKLCLMSDETVDISELSGDELKKAIFEAVKRRKPFFIMKGEILALISAVLWGFAPILDRYALLSGAPIYAALAIRAFGALIAMLFILSVLRGGLAVEAKAAVLLLIAGAIGGALAMVFYYLALESVGASRTVPITAIYPMFTALFSFLLLSEPLSPKTIAGIAFIVLGVILVSEGMVKLRGEDVVIRKYDHSMDRDKLIEMYVYDPRFRCLGLPPLSKEAIKGWIDYLGQGFAIIAEKDGKIVGHLVIVPGEREVDLTIFIHQDYQLGLGQEMMKLIIDFCRKAGFAITLVTERTARAIHVYRKLGFEIVAPYYEYDMRLQLKMIVPKGKTVLIKGTASIRGECEVLGARLFFESEKFVPVFCLEDCEIEVGEFKILDGSTIPESWEKLSKMDWETVFLYGGVDSGKSTLATYLAKVGGAYVLDLDIGQADVAPGAMGYGFAKDVVSLSKVSMIGFFVGSITPQGREAKCLRGVARLWKELRKLDGRKIIDTTGWVRGRRAKEYKLAKLEIIEPDLIASFEGKLFDWKTFEVEKGYVIRRDKDRAKARFESYRKFLDGAKTFELERDGIKLKPDFFKGKDVSQFIESVLGTRVVFARLGEEHLTICTKEDCPEYEILRELKELYEVDDIFLFSESEARFVAGLYRGKKYLGIGLIKSIDRILLECTQSDFDTIEIGEIRLEDGRECFIKRFMARIAYSYKPQDETRAARAMGYEVPISFKHAMEICRVLKGKKVPQAISFLEEVVQL

蛋白质序列，则是有MLTDP等20个字母/氨基酸 组成的文本，当然，目前对蛋白质的理解远超过对DNA的。

然后就是英文文本了，这个就比较容易看懂

In [9]:
english_dataset = load_dataset('text', data_files='data/english_500m.txt')
english_dataset["train"][301]

{'text': ' " There \'s Got to Be a Way " is a song by American singer and songwriter Mariah Carey from her self @-@ titled debut studio album ( 1990 ) . Columbia released it as the fifth and final single from the album in the United Kingdom . It was one of four songs Carey wrote with Ric Wake during their first recording session together , but " There \'s Got to Be a Way " was the only composition to make the final track listing . It is a socio @-@ political conscious R & B @-@ pop song which addresses the existence of poverty , racism and war in the world which gradually becomes more aspirational and positive as it progresses . The track garnered a mixed reception upon the album \'s release in 1990 . While Carey \'s vocals were praised , it was seen as too political . An accompanying music video highlights social injustices . The song reached number 54 on the UK Singles Chart . '}

英文序列，就是26个字母组成的文本了，当然，英文是包括空格的，生物序列则没有明确的空格

前面这些数据集，就是常规的文本，一般就是当做预训练数据使用，而分类等下游微调任务，一般都是包含标签的，多写成json或者csv的格式，这里也给出一个例子：

In [11]:
ft_dataset = load_dataset('json', data_files='data/dna_protein_my.json')
ft_dataset["train"][0]

Generating train split: 0 examples [00:00, ? examples/s]

{'sentence1': 'ATGGAGGAAAATCAGACCATGGTCACAGAGTTCGTCCTGCTGGGATTCTGTCTTGGCCCGAGGATTCACCTAGTTCTTTTTCTGCTTTTCTCTCTCTTCTATACTCTCACCATACTGGGGAATGGGACTATCCTTGCAATGATCTGCCTGGACTCCAGACTCCACACTCCCATGTACTTCTTCCTGTCCCACCTGGCCATTGTCGATATGGCCTATGCCTGCAACACAGTGCCTCAGACACTCATAAACCTCTTGGATGAGACCAGGCCCATCACCTTTGCTGGATGCATGACACAGACCTTTCTCTTCTTGGCTTTTGCCCACACTGAATGTGTGCTCCTTGTTGTGATGTCCTATGACCGGTATGTAGCTATCTGCCACCCGCTACACTACACTGTCATCATGAACTGGAGAGTGTGTACCATTCTGGCTGCTGTTTCCTGGATATTTAGCTTTCTCCTTGCTCTGGTCCATTTAGTTCTCATCCTGAGGCTGCCCTTCTGTGGACCTCATGAAATCAATCACTTCTTCTGTGAAATCCTGTCTGTCCTCAAGCTGGCCTGTGCTGACACAACACTCAATCAGGTCGTTATCTTTGCAGCTTGTGTGTTCATATTAGTGGCCCCCCTATGCTTTGTACTAGTCTCCTACACACGCATCCTGGTGGCCATCCTGAGGATCCAGTCAGGGGAGGGACGCAGAAAGGCCTTCTCTACCTGTTCCTCCCACCTCTGTGTGGTAGGGCTCTTCTTTGGCAGTGCCATTGTCATGTACATGGCCCCCAAGTCCCAGCACCCAGAGGAGCAGCAGAAGGTTCTTTTCCTGTTTTACAGTTTTTTCAACCCCATGCTGAACCCCCTAATCTACAGTCTGAGGAATGCTGAGGTGAAGGGCGCCCTCAAGAGGTCACTGTGCAAAGAAAGTCATTCCTGGTTGGTGTGGTGTTCGGACCATAAATCTTGG',
 'sentence2': 'MEEN