GAIA: A Global, Multi-modal, Multi-scale Vision-Language Dataset for Remote Sensing Image Analysis

Angelos Zavras, Dimitrios Michail, Xiao Xiang Zhu, Begüm Demir, Ioannis Papoutsis

February 2025 Code, Datasets

Abstract

The continuous operation of Earth-orbiting satellites generates vast and ever-growing archives of Remote Sensing (RS) images. Natural language presents an intuitive interface for accessing, querying, and interpreting the data from such archives. However, existing Vision-Language Models (VLMs) are predominantly trained on web-scraped, noisy image-text data, exhibiting limited exposure to the specialized domain of RS. This deficiency results in poor performance on RS-specific tasks, as commonly used datasets often lack detailed, scientifically accurate textual descriptions and instead emphasize solely on attributes like date and location. To bridge this critical gap, we introduce GAIA, a novel dataset designed for multi-scale, multi-sensor, and multi-modal RS image analysis. GAIA comprises of 205,150 meticulously curated RS image-text pairs, representing a diverse range of RS modalities associated to different spatial resolutions. Unlike existing vision-language datasets in RS, GAIA specifically focuses on capturing a diverse range of RS applications, providing unique information about environmental changes, natural disasters, and various other dynamic phenomena. The dataset provides a spatially and temporally balanced distribution, spanning across the globe, covering the last 25 years with a balanced temporal distribution of observations. GAIA’s construction involved a two-stage process: (1) targeted web-scraping of images and accompanying text from reputable RS-related sources, and (2) generation of five high-quality, scientifically grounded synthetic captions for each image using carefully crafted prompts that leverage the advanced vision-language capabilities of GPT-4o. We also release an automated processing framework developed for this purpose, enabling the broader research community to generate captions for RS images using the web-crawled image-text data. Our extensive experiments, including fine-tuning of CLIP and BLIP2 models, demonstrate that GAIA significantly improves performance on RS image classification, cross-modal retrieval and image captioning tasks, proving its value as a crucial resource for advancing the field.

Type

Manuscript

Publication

arXiv

Add the full text or supplementary notes for the publication here using Markdown formatting.

vision-language dataset vision-language model representation learning remote sensing

GAIA: A Global, Multi-modal, Multi-scale Vision-Language Dataset for Remote Sensing Image Analysis

Abstract

Angelos Zavras

PhD Candidate

Dimitrios Michail

Associate Professor @ HUA
Adjunct Researcher @ OrionLab

Ioannis Papoutsis

Head of Orion Lab
Assistant Professor of Artificial Intelligence for Earth Observation @ NTUA
Adjunct Researcher @ NOA

GAIA: A Global, Multi-modal, Multi-scale Vision-Language Dataset for Remote Sensing Image Analysis

Abstract

Angelos Zavras

PhD Candidate

Dimitrios Michail

Associate Professor @ HUAAdjunct Researcher @ OrionLab

Ioannis Papoutsis

Head of Orion LabAssistant Professor of Artificial Intelligence for Earth Observation @ NTUAAdjunct Researcher @ NOA

Associate Professor @ HUA
Adjunct Researcher @ OrionLab

Head of Orion Lab
Assistant Professor of Artificial Intelligence for Earth Observation @ NTUA
Adjunct Researcher @ NOA