[WIP] GAEA: A Geolocation Aware Conversational Assistant
Summary
Image geolocalization, in which an AI model traditionally predicts the precise GPS coordinates of an image, is a challenging task with many downstream applications. However, the user cannot utilize the model to further their knowledge beyond the GPS coordinates; the model lacks an understanding of the location and the conversational ability to communicate with the user. In recent days, with the tremendous progress of large multimodal models (LMMs) — proprietary and open-source — researchers have attempted to geolocalize images via LMMs. However, the issues remain unaddressed; beyond general tasks, for more specialized downstream tasks, such as geolocalization, LMMs struggle. In this work, we propose solving this problem by introducing a conversational model, GAEA, that provides information regarding the location of an image as the user requires. No large-scale dataset enabling the training of such a model exists. Thus, we propose GAEA-1.4M, a comprehensive dataset comprising over 800k images and approximately 1.4M question-answer pairs, constructed by leveraging OpenStreetMap (OSM) attributes and geographical context clues. For quantitative evaluation, we propose a diverse benchmark, GAEA-Bench, comprising 3.5k image-text pairs to evaluate conversational capabilities equipped with diverse question types. We consider 11 state-of-the-art open-source and proprietary LMMs and demonstrate that GAEA significantly outperforms the best open-source model, LLaVA-OneVision, by 18.2% and the best proprietary model, GPT-4o, by 7.2%. Our dataset, model, and codes are publicly available.
GAEA
is the first open-source conversational model for conversational capabilities equipped with global-scale geolocalization.
Main contributions:
GAEA-Train: A Diverse Training Dataset:
We propose GAEA-Train, a new dataset designed for training conversational image geolocalization models, incorporating diverse visual and contextual data.GAEA-Bench: Evaluating Conversational Geolocalization:
To assess conversational capabilities in geolocalization, we introduce GAEA-Bench, a benchmark featuring various question-answer formats.GAEA: An Interactive Geolocalization Chatbot:
We present GAEA, a conversational chatbot that extends beyond geolocalization to provide rich contextual insights about locations from images.Benchmarking Against State-of-the-Art LMMs:
We quantitatively compare our model’s performance against 8 open-source and 3 proprietary LMMs, including GPT-4o and Gemini-2.0-Flash.
This page is dedicated to the GAEA model
We compare the performance of various LMMs on the geographically-grounded visual-question-answering task, included in our new GAEA-Bench benchmark. Most LMMs can describe the Wat Pho statue, but only GAEA, our Geolocation Aware Assistant, retrieves the correct nearby cafe, Cafe Amazon (left). Qualitative SVQA comparison showing GAEA’s ability to provide accurate, location-specific answers where other LMMs fail (right).
Model Description
Architecture
Overview of the GAEA model architecture and workflow. An input image is first processed by a Vision Transformer (ViT) encoder, whose output is projected through a visual projector to obtain visual embeddings. Simultaneously, the input text prompt is converted into text embeddings. The combined visual and textual embeddings are then fed into the Qwen2.5 LLM space, which generates a response based on the multimodal input. We follow the single-stage training approach, unfreezing MLP, and performing LoRA fine-tuning in the same stage.
Evaluation Results
Comparison with SoTA LMMs on GAEA-Bench (Conversational)
We benchmark 11 open-source and proprietary LMMs on GAEA-Bench. Notably, GAEA outperforms all open-source models and fares higher than the proprietary models on decision-making questions (MCQs and TFs). We provide the relative performance change for each model compared to GAEA. We use GPT-4o as a judge for evaluation, and it has been documented that LLMs as judges prefer their long-form output; hence, the scores for these models are likely overestimated.
We showcase the performance of various LMMs on four diverse question types. GAEA outperforms on average across all question forms.
Qualitative Results (Conversational)
Qualitative MCQs comparison showing GAEA’s ability to provide accurate answers where other LMMs fail.
Comparison with Specialized Models on Standard Geolocalization Datasets
We benchmark the performance of various specialized models on standard geolocation datasets. GAEA demonstrates competitive results, outperforming GaGA on multiple distance thresholds in both IM2GPS and IM2GPS3k.
Comparison with best SoTA LMMs on City/Country Prediction
Classification accuracy for both city and country labels, where GAEA surpasses several recent LMMs in performance.
Citation
BibTeX:
@misc{campos2025gaeageolocationawareconversational,
title={GAEA: A Geolocation Aware Conversational Assistant},
author={Ron Campos and Ashmal Vayani and Parth Parag Kulkarni and Rohit Gupta and Aritra Dutta and Mubarak Shah},
year={2025},
eprint={2503.16423},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2503.16423},
}
Licensing Information
We release our work under CC BY-NC 4.0 License. The CC BY-NC 4.0 license allows others to share, remix, and adapt the work, as long as it's for non-commercial purposes and proper attribution is given to the original creator.
Model tree for ucf-crcv/GAEA-7B
Base model
Qwen/Qwen2.5-VL-7B-Instruct