{ "cells": [ { "cell_type": "markdown", "id": "68c06a52-e27c-4da6-8a02-cd010270bedf", "metadata": {}, "source": [ "# 3 datasets库基本使用" ] }, { "cell_type": "markdown", "id": "2dc4c70f-694c-4785-81d8-26ebab2b7210", "metadata": {}, "source": [ "## 基本使用\n", "上一节中,已经介绍了使用datasets读取本地文件的方法,这一节继续介绍datasets一些常用的方法\n", "\n", "首先是数据分割,因为我们从数据源获得DNA序列等数据,都是一个文本文件,但训练的时候,一般都需要分成训练集和测试集等\n", "\n", "一个简单的例子如下所示:" ] }, { "cell_type": "code", "execution_count": 1, "id": "6e9f346f-31f6-40cc-86e5-723c65033883", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "DatasetDict({\n", " train: Dataset({\n", " features: ['text'],\n", " num_rows: 1025615\n", " })\n", " test: Dataset({\n", " features: ['text'],\n", " num_rows: 53980\n", " })\n", "})" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#读取dna数据\n", "from datasets import load_dataset\n", "dna_dataset = load_dataset('text', data_files='data/dna_1g.txt')['train'].rain_test_split(test_size=0.05) #默认已经shuffle\n", "dna_dataset" ] }, { "cell_type": "code", "execution_count": 2, "id": "75900650-74da-4ca9-a285-b2832a5a1485", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'text': 'ATGTGTGCAATGGGTTATCTTTATGTAATAACAGTCATATCACGGGTGTTCCTCAGAAGTAGTGAACTGGCTAGCATTTTTAGACACTATGTGATCTCTCATATGACTACACTCAATTTAAAATAAAATGAAATGTGTTGTGTGTGTCTAAAATCTATAAAGGGAAAAGTATCTTAAGTATTTTTTAGATGTTAAAGTAGATGTGTATCCTAAAATATGCATTGTTCACAGATGTTAAAATTACAACTACAATCTGTGAAACACAGATCTTAGGACAGCAATGTTTCACAAGAAAAAAAATGATGCAGCCTTCTTTAGTATTTATAGTCATTTGAACAATTATGGCAACCATAAGTTCATATATAACATCCCCATTTGGTGAAACTAGTTGGGAAAGATTAGAAGGTATGACCTTGTTGGAGGAACTATACCATTGGGGTGGCTTTGAGACTTCAGAAGTTTCAAGGCCCATTTAGTGCTTTCTACCTTATGAAGCTGTGAGTTCTCCTTGCTAGCTACATAACTTGGAAAGCAGGCCCTGCACTTCACCCAAGGAGCACATTAGAGCTGGCCCTTTTGGAAGGCAATTGCGTAAGCCACACCAGGGCACCAGAGATCTGGCACTGCCATGCTCCTGCTTGCAAGTAGTGGTGTGGGTGTTGGGTGATGCCCTCCAGTCCCACCTTTTGCCACCTGTAGTAGTCAGGGGAGTTGGCCTAAGGGCATGAGAGCCTAAGACTTCACCCTAATCCCTCACCAACTGTAGCATGTGGAAGAGCAGGCTCTGTACCTTCCCTGGGCAACACATTGGAGCTGGCCCCTCACAGGCTGCAGGACTTGGGAGAGTGAGTGCTGCACCTTGACTGTGAAGGTGGTTTTGGAGGTGTGGGTGTGAGACCATGAGACCAAGAGAGGAATGGAATATTACTCACTTATTAAAAACAATGACTTCATGAAATTTGCAGGCAAATGGATGGAACTTGAAAATATCCTGAGTGAG'}" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dna_dataset[\"test\"][0]" ] }, { "cell_type": "markdown", "id": "cdcc5404-6590-47a4-be2c-2c1d35d3bae4", "metadata": {}, "source": [ "可以看到,数据集已经分割成了train和test两个数据集,而在分割的时候,已经进行的随机处理\n", "\n", "当然,如果数据集过大,我们只需要其中一部分,这个也是一个常见的需求,一般可以使用 Dataset.select() 函数" ] }, { "cell_type": "code", "execution_count": 4, "id": "049ad194-cb60-4b0f-8554-1915bfc7a9cd", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "DatasetDict({\n", " train: Dataset({\n", " features: ['text'],\n", " num_rows: 50000\n", " })\n", " valid: Dataset({\n", " features: ['text'],\n", " num_rows: 500\n", " })\n", "})" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from datasets import load_dataset, DatasetDict\n", "\n", "dna_dataset_sample = DatasetDict(\n", " {\n", " \"train\": dna_dataset[\"train\"].shuffle().select(range(50000)), \n", " \"valid\": dna_dataset[\"test\"].shuffle().select(range(500)),\n", " \"evla\": dna_dataset[\"test\"].shuffle().select(range(500))\n", "\n", " }\n", ")\n", "dna_dataset_sample" ] }, { "cell_type": "markdown", "id": "50cceda3-36ca-4fa6-bfb5-1dbeb155fe4f", "metadata": {}, "source": [ "可以看到,我们使用DatasetDict来直接构造datasets,先使用shuffle()来随机,然后使用select来选择前n个数据\n", "\n", "select的参数为indices (list 或 range): 索引列表或范围对象,指明要选择哪些样本,如dataset.select([0, 2, 4])就是选择1,3,5条记录" ] }, { "attachments": {}, "cell_type": "markdown", "id": "17a1fa7c-ff4b-419f-8a82-e58cc5777cd4", "metadata": {}, "source": [ "## 读取线上库\n", "\n", "当然,数据也可以直接从huggingface的线上仓库读取,这时候需要注意科学上网问题。\n", "\n", "具体使用函数也是load_dataset\n", "\n", "" ] }, { "cell_type": "code", "execution_count": 9, "id": "6ae24950-2c74-457b-b1f2-d2e4397e1fa1", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "\"\\nimport os\\n\\n# 设置环境变量, autodl专区 其他idc\\nos.environ['HF_ENDPOINT'] = 'https://hf-mirror.com'\\n\\n# 打印环境变量以确认设置成功\\nprint(os.environ.get('HF_ENDPOINT'))\\n\"" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import subprocess\n", "import os\n", "# 设置环境变量, autodl一般区域\n", "result = subprocess.run('bash -c \"source /etc/network_turbo && env | grep proxy\"', shell=True, capture_output=True, text=True)\n", "output = result.stdout\n", "for line in output.splitlines():\n", " if '=' in line:\n", " var, value = line.split('=', 1)\n", " os.environ[var] = value\n", "\n", "#或者\n", "\"\"\"\n", "import os\n", "\n", "# 设置环境变量, autodl专区 其他idc\n", "os.environ['HF_ENDPOINT'] = 'https://hf-mirror.com'\n", "\n", "# 打印环境变量以确认设置成功\n", "print(os.environ.get('HF_ENDPOINT'))\n", "\"\"\"" ] }, { "cell_type": "code", "execution_count": 10, "id": "30ff9798-d06d-4992-81fc-03102f03599b", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "DatasetDict({\n", " train: Dataset({\n", " features: ['sequence', 'label'],\n", " num_rows: 59196\n", " })\n", "})" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from datasets import load_dataset\n", "dna_data = load_dataset(\"dnagpt/dna_core_promoter\")\n", "dna_data" ] }, { "cell_type": "markdown", "id": "30c4b754-af11-4ac1-9742-45427059617e", "metadata": {}, "source": [ "当然,如果你想分享你的数据集到huggingface上面,也是一行函数即可:" ] }, { "cell_type": "code", "execution_count": null, "id": "f9847be9-e085-41e3-ad29-a450cc017d64", "metadata": {}, "outputs": [], "source": [ "dna_data.push_to_hub(\"org_name/your_dataset_name\", token=\"hf_yourtoken\")" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.3" } }, "nbformat": 4, "nbformat_minor": 5 }