llama / data /xtuner /docs /en /user_guides /dataset_prepare.md
kai119's picture
Upload folder using huggingface_hub
22fb4ec verified

A newer version of the Streamlit SDK is available: 1.43.2

Upgrade

Dataset Prepare

HuggingFace datasets

For datasets on HuggingFace Hub, such as alpaca, you can quickly utilize them. For more details, please refer to single_turn_conversation.md and multi_turn_conversation.md.

Others

Arxiv Gentitle

Arxiv dataset is not released on HuggingFace Hub, but you can download it from Kaggle.

Step 0, download raw data from https://kaggle.com/datasets/Cornell-University/arxiv.

Step 1, process data by xtuner preprocess arxiv ${DOWNLOADED_DATA} ${SAVE_DATA_PATH} [optional arguments].

For example, get all cs.AI, cs.CL, cs.CV papers from 2020-01-01:

xtuner preprocess arxiv ${DOWNLOADED_DATA} ${SAVE_DATA_PATH} --categories cs.AI cs.CL cs.CV --start-date 2020-01-01

Step 2, all Arixv Gentitle configs assume the dataset path to be ./data/arxiv_data.json. You can move and rename your data, or make changes to these configs.

MOSS-003-SFT

MOSS-003-SFT dataset can be downloaded from https://huggingface.co/datasets/fnlp/moss-003-sft-data.

Step 0, download data.

# Make sure you have git-lfs installed (https://git-lfs.com)
git lfs install
git clone https://huggingface.co/datasets/fnlp/moss-003-sft-data

Step 1, unzip.

cd moss-003-sft-data
unzip moss-003-sft-no-tools.jsonl.zip
unzip moss-003-sft-with-tools-no-text2image.zip

Step 2, all moss-003-sft configs assume the dataset path to be ./data/moss-003-sft-no-tools.jsonl and ./data/conversations_with_tools_with_inner_instruction_no_text2image_train_all_random_meta0.5_0.1_0.01_moss_0709.jsonl. You can move and rename your data, or make changes to these configs.

Chinese Lawyer

Chinese Lawyer dataset has two sub-dataset, and can be downloaded form https://github.com/LiuHC0428/LAW-GPT.

All lawyer configs assume the dataset path to be ./data/CrimeKgAssitantๆธ…ๆด—ๅŽ_52k.json and ./data/่ฎญ็ปƒๆ•ฐๆฎ_ๅธฆๆณ•ๅพ‹ไพๆฎ_92k.json. You can move and rename your data, or make changes to these configs.

LLaVA dataset

File structure

./data/llava_data
โ”œโ”€โ”€ LLaVA-Pretrain
โ”‚   โ”œโ”€โ”€ blip_laion_cc_sbu_558k.json
โ”‚   โ”œโ”€โ”€ blip_laion_cc_sbu_558k_meta.json
โ”‚   โ””โ”€โ”€ images
โ”œโ”€โ”€ LLaVA-Instruct-150K
โ”‚   โ””โ”€โ”€ llava_v1_5_mix665k.json
โ””โ”€โ”€ llava_images
    โ”œโ”€โ”€ coco
    โ”‚   โ””โ”€โ”€ train2017
    โ”œโ”€โ”€ gqa
    โ”‚   โ””โ”€โ”€ images
    โ”œโ”€โ”€ ocr_vqa
    โ”‚   โ””โ”€โ”€ images
    โ”œโ”€โ”€ textvqa
    โ”‚   โ””โ”€โ”€ train_images
    โ””โ”€โ”€ vg
        โ”œโ”€โ”€ VG_100K
        โ””โ”€โ”€ VG_100K_2

Pretrain

LLaVA-Pretrain

# Make sure you have git-lfs installed (https://git-lfs.com)
git lfs install
git clone https://huggingface.co/datasets/liuhaotian/LLaVA-Pretrain --depth=1

Finetune

  1. Text data

    1. LLaVA-Instruct-150K

      # Make sure you have git-lfs installed (https://git-lfs.com)
      git lfs install
      git clone https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K --depth=1
      
  2. Image data

    1. COCO (coco): train2017

    2. GQA (gqa): images

    3. OCR-VQA (ocr_vqa): download script

      1. โš ๏ธ Modify the name of OCR-VQA's images to keep the extension as .jpg!

        #!/bin/bash
        ocr_vqa_path="<your-directory-path>"
        
        find "$target_dir" -type f | while read file; do
            extension="${file##*.}"
            if [ "$extension" != "jpg" ]
            then
                cp -- "$file" "${file%.*}.jpg"
            fi
        done
        
    4. TextVQA (textvqa): train_val_images

    5. VisualGenome (VG): part1, part2

RefCOCO dataset

File structure


./data
โ”œโ”€โ”€ refcoco_annotations
โ”‚   โ”œโ”€โ”€ refcoco
โ”‚   โ”‚   โ”œโ”€โ”€ instances.json
โ”‚   โ”‚   โ”œโ”€โ”€ refs(google).p
โ”‚   โ”‚   โ””โ”€โ”€ refs(unc).p
โ”‚   โ”œโ”€โ”€ refcoco+
โ”‚   โ”‚   โ”œโ”€โ”€ instances.json
โ”‚   โ”‚   โ””โ”€โ”€ refs(unc).p
โ”‚   โ””โ”€โ”€ refcocog
โ”‚       โ”œโ”€โ”€ instances.json
โ”‚       โ”œโ”€โ”€ refs(google).p
โ”‚       โ””โ”€โ”€โ”€ refs(und).p
โ”œโ”€โ”€ coco_images
|    โ”œโ”€โ”€ *.jpg
...

Download the RefCOCO, RefCOCO+, RefCOCOg annotation files using below links. Both of coco train 2017 and 2014 are valid for coco_images.

Image source Download path
RefCOCO annotations
RefCOCO+ annotations
RefCOCOg annotations

After downloading the annotations, unzip the files and place them in the ./data/refcoco_annotations directory. Then, we convert the annotations to json format using the below command. This command saves the converted json files in the ./data/llava_data/RefCOCOJson/ directory.

xtuner preprocess refcoco --ann-path $RefCOCO_ANN_PATH --image-path $COCO_IMAGE_PATH \
--save-path $SAVE_PATH # ./data/llava_data/RefCOCOJson/