Mind the modality gap: Towards a remote sensing vision-language model via cross-modal alignment

Angelos Zavras, Dimitrios Michail, Begüm Demir, Ioannis Papoutsis

February 2024 Code

Abstract

Deep Learning (DL) is undergoing a paradigm shift with the emergence of foundation models. In this work, we focus on Contrastive Language-Image Pre-training (CLIP), a Vision-Language foundation model that achieves high accuracy across various image classification tasks and often rivals fully supervised baselines, despite not being explicitly trained for those tasks. Nevertheless, there are still domains where zero-shot CLIP performance is far from optimal, such as Remote Sensing (RS) and medical imagery. These domains do not only exhibit fundamentally different distributions compared to natural images, but also commonly rely on complementary modalities, beyond RGB, to derive meaningful insights. To this end, we propose a methodology to align distinct RS image modalities with the visual and textual modalities of CLIP. Our two-stage procedure addresses the distribution shift and enhances CLIP zero-shot capabilities. Initially, we robustly fine-tune CLIP according to the PAINT patching protocol, in order to deal with the aforementioned distribution shift. Building upon this foundation, we facilitate the cross-modal alignment of a RS modality encoder by distilling knowledge from the CLIP visual and textual encoders. This process extends the zero-shot capabilities of CLIP and enriches CLIP shared embedding space with domain-specific knowledge. We ultimately demonstrate our method on the tasks of RS imagery classification and cross-modal retrieval. We empirically show that both robust fine-tuning and cross-modal alignment translate to significant performance gains, across several RS benchmark datasets. Notably, these enhancements are achieved without the reliance on textual descriptions, without introducing any task-specific parameters, without training from scratch and without catastrophic forgetting. Our work highlights the potential of leveraging existing VLMs large-scale pre-training and extending their zero-shot capabilities to specialized fields, paving the way for resource efficient establishment of in-domain multi-modal foundation models in RS and beyond.

Type

Manuscript

Publication

arXiv

Add the full text or supplementary notes for the publication here using Markdown formatting.

vision-language model foundation model multi-modal learning cross-modal alignment cross-modal retrieval cross-modal distillation satellite representation learning remote sensing

Mind the modality gap: Towards a remote sensing vision-language model via cross-modal alignment

Abstract

Angelos Zavras

PhD Candidate

Dimitrios Michail

Associate Professor @ HUA
Adjunct Researcher @ OrionLab

Ioannis Papoutsis

Head of Orion Lab
Assistant Professor of Artificial Intelligence for Earth Observation @ NTUA
Adjunct Researcher @ NOA

Mind the modality gap: Towards a remote sensing vision-language model via cross-modal alignment

Abstract

Angelos Zavras

PhD Candidate

Dimitrios Michail

Associate Professor @ HUAAdjunct Researcher @ OrionLab

Ioannis Papoutsis

Head of Orion LabAssistant Professor of Artificial Intelligence for Earth Observation @ NTUAAdjunct Researcher @ NOA

Associate Professor @ HUA
Adjunct Researcher @ OrionLab

Head of Orion Lab
Assistant Professor of Artificial Intelligence for Earth Observation @ NTUA
Adjunct Researcher @ NOA