--- license: mit --- ## RS-LLaVA: Large Vision Language Model for Joint Captioning and Question Answering in Remote Sensing Imagery - **Repository:** https://github.com/BigData-KSU/RS-LLaVA - **Paper:** https://www.mdpi.com/2072-4292/16/9/1477 - **Demo:** Soon. ## How to Get Started with the Model ### Install 1. Clone this repository and navigate to RS-LLaVA folder ``` git clone https://github.com/BigData-KSU/RS-LLaVA.git cd RS-LLaVA ``` 2. Install Package ``` conda create -n rs-llava python=3.10 -y conda activate rs-llava pip install --upgrade pip # enable PEP 660 support ``` 3. Install additional packages ``` pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118 pip install transformers==4.35 pip install einops pip inastall SentencePiece pip install accelerate pip install peft ``` --- ### Inference Use the code below to get started with the model. ```python import torch import os from llava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN from llava.conversation import conv_templates, SeparatorStyle from llava.model.builder import load_pretrained_model from llava.utils import disable_torch_init from llava.mm_utils import tokenizer_image_token, get_model_name_from_path, KeywordsStoppingCriteria from PIL import Image import math ######## model here................. model_path = 'BigData-KSU/RS-llava-v1.5-7b-LoRA' model_base = 'Intel/neural-chat-7b-v3-3' #### Further instrcutions here.......... conv_mode = 'llava_v1' disable_torch_init() model_name = get_model_name_from_path(model_path) print('model name', model_name) print('model base', model_base) tokenizer, model, image_processor, context_len = load_pretrained_model(model_path, model_base, model_name) def chat_with_RS_LLaVA(cur_prompt,image_name): # Prepare the input text, adding image-related tokens if needed image_mem = Image.open(image_name) image_tensor = image_processor.preprocess(image_mem, return_tensors='pt')['pixel_values'][0] if model.config.mm_use_im_start_end: cur_prompt = f"{DEFAULT_IM_START_TOKEN} {DEFAULT_IMAGE_TOKEN} {DEFAULT_IM_END_TOKEN}\n{cur_prompt}" else: cur_prompt = f"{DEFAULT_IMAGE_TOKEN}\n{cur_prompt}" # Create a copy of the conversation template conv = conv_templates[conv_mode].copy() conv.append_message(conv.roles[0], cur_prompt) conv.append_message(conv.roles[1], None) prompt = conv.get_prompt() # Process image inputs if provided input_ids = tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors='pt').unsqueeze(0) .cuda() stop_str = conv.sep if conv.sep_style != SeparatorStyle.TWO else conv.sep2 keywords = [stop_str] stopping_criteria = KeywordsStoppingCriteria(keywords, tokenizer, input_ids) with torch.inference_mode(): output_ids = model.generate( input_ids, images=image_tensor.unsqueeze(0).half().cuda(), do_sample=True, temperature=0.2, top_p=None, num_beams=1, no_repeat_ngram_size=3, max_new_tokens=2048, use_cache=True) input_token_len = input_ids.shape[1] n_diff_input_output = (input_ids != output_ids[:, :input_token_len]).sum().item() if n_diff_input_output > 0: print(f'[Warning] {n_diff_input_output} output_ids are not the same as the input_ids') outputs = tokenizer.batch_decode(output_ids[:, input_token_len:], skip_special_tokens=True)[0] outputs = outputs.strip() return outputs if __name__ == "__main__": print('Model input...............') cur_prompt='Generate three questions and answers about the content of this image. Then, compile a summary.' image_name='assets/example_images/parking_lot_010.jpg' outputs=chat_with_RS_LLaVA(cur_prompt,image_name) print('Model Response.....') print(outputs) ``` ## Training Details Training RS-LLaVa is carried out in three stages: #### Stage 1: Pretraining (Feature alignment) stage: Using LAION/CC/SBU BLIP-Caption Concept-balanced 558K dataset, and two RS datasets, [NWPU](https://github.com/HaiyanHuang98/NWPU-Captions) and [RSICD](https://huggingface.co/datasets/arampacha/rsicd). | Dataset | Size | Link | | --- | --- |--- | |CC-3M Concept-balanced 595K|211 MB|[Link](https://github.com/haotian-liu/LLaVA/blob/main/docs/Data.md)| |NWPU-RSICD-Pretrain|16.6 MB|[Link](https://huggingface.co/datasets/BigData-KSU/RS-instructions-dataset/blob/main/NWPU-RSICD-pretrain.json)| #### Stage 2: Visual Instruction Tuning: To teach the model to follow instructions, we used the proposed RS-Instructions Dataset plus LLaVA-Instruct-150K dataset. | Dataset | Size | Link | | --- | --- |--- | |RS-Instructions|91.3 MB|[Link](https://huggingface.co/datasets/BigData-KSU/RS-instructions-dataset/blob/main/NWPU-RSICD-UAV-UCM-LR-DOTA-intrcutions.json)| |llava_v1_5_mix665k|1.03 GB|[Link](https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K/blob/main/llava_v1_5_mix665k.json)| #### Stage 3: Downstram Task Tuning: In this stage, the model is fine-tuned on one of the downstream tasks (e.g., RS image captioning or VQA) ## Citation **BibTeX:** ```bibtex @Article{rs16091477, AUTHOR = {Bazi, Yakoub and Bashmal, Laila and Al Rahhal, Mohamad Mahmoud and Ricci, Riccardo and Melgani, Farid}, TITLE = {RS-LLaVA: A Large Vision-Language Model for Joint Captioning and Question Answering in Remote Sensing Imagery}, JOURNAL = {Remote Sensing}, VOLUME = {16}, YEAR = {2024}, NUMBER = {9}, ARTICLE-NUMBER = {1477}, URL = {https://www.mdpi.com/2072-4292/16/9/1477}, ISSN = {2072-4292}, DOI = {10.3390/rs16091477} } ```