--- license: mit base_model: - mistralai/Pixtral-12B-2409 pipeline_tag: image-text-to-text library_name: transformers tags: - lora datasets: - Multimodal-Fatima/FGVC_Aircraft_train - takara-ai/FloodNet_2021-Track_2_Dataset_HF --- Takara.ai Logo From the Frontier Research Team at **Takara.ai** we present a specialized LoRA adapter for aerial imagery analysis and visual question answering. --- # pixtral_aerial_VQA_adapter ## Overview This repository contains a fine-tuned LoRA adapter for the Pixtral-12B model, optimized specifically for aerial imagery analysis and visual question answering. The adapter enables detailed processing of aerial footage with a focus on construction site surveying, structural assessment, and environmental monitoring. ## Model Details - **Type**: LoRA Adapter - **Total Parameters**: 6,225,920 - **Memory Usage**: 23.75 MB - **Precisions**: torch.float32 - **Layer Types**: - lora_A: 40 - lora_B: 40 - **Base Model**: [mistralai/Pixtral-12B-2409](https://huggingface.co/mistralai/Pixtral-12B-2409) ## Capabilities The adapter enhances Pixtral's ability to: - Identify and describe construction elements in aerial imagery - Detect structural issues in buildings and infrastructure - Analyze progress in construction projects - Monitor environmental changes and flooding events - Process high-resolution aerial imagery with improved detail recognition ## Intended Use - **Primary intended uses**: Processing aerial footage of construction sites for structural and construction surveying. - Can also be applied to any detailed VQA use cases with aerial footage. - Suitable for disaster response and assessment applications, particularly flood monitoring. ## Training Data - **Dataset**: 1. [FloodNet Track 2 dataset](https://huggingface.co/datasets/takara-ai/FloodNet_2021-Track_2_Dataset_HF) 2. Subset of [FGVC Aircraft dataset](https://huggingface.co/datasets/Multimodal-Fatima/FGVC_Aircraft_train) 3. Custom dataset of 10 image-caption pairs created using Pixtral ## Training Procedure - **Training method**: LoRA (Low-Rank Adaptation) - **Base model**: Ertugrul/Pixtral-12B-Captioner-Relaxed - **Training hardware**: Nebius-hosted NVIDIA H100 machine ## Usage Example ```python from transformers import AutoProcessor, AutoModelForCausalLM import torch from PIL import Image # Load model and processor model_id = "takara-ai/pixtral_aerial_VQA_adapter" processor = AutoProcessor.from_pretrained(model_id) model = AutoModelForCausalLM.from_pretrained( model_id, torch_dtype=torch.float16, device_map="auto" ) # Load and process image image = Image.open("path_to_aerial_image.jpg") prompt = "Describe the construction progress visible in this aerial image." # Generate response inputs = processor(text=prompt, images=image, return_tensors="pt").to(model.device) generated_ids = model.generate( **inputs, max_new_tokens=512, do_sample=True, temperature=0.7 ) response = processor.batch_decode(generated_ids, skip_special_tokens=True)[0] print(response) ``` ## Citation ```bibtex @misc{rahnemoonfar2020floodnet, title={FloodNet: A High Resolution Aerial Imagery Dataset for Post Flood Scene Understanding}, author={Maryam Rahnemoonfar and Tashnim Chowdhury and Argho Sarkar and Debvrat Varshney and Masoud Yari and Robin Murphy}, year={2020}, eprint={2012.02951}, archivePrefix={arXiv}, primaryClass={cs.CV}, doi={10.48550/arXiv.2012.02951} } ``` --- For research inquiries and press, please reach out to research@takara.ai > 人類を変革する