CROMA: Remote Sensing Representations with Contrastive Radar-Optical Masked Autoencoders
Abstract
A vital and rapidly growing application, remote sensing offers vast yet sparsely labeled, <PRE_TAG>spatially aligned</POST_TAG> <PRE_TAG>multimodal data</POST_TAG>; this makes self-supervised learning algorithms invaluable. We present CROMA: a framework that combines contrastive and reconstruction self-supervised objectives to learn rich unimodal and <PRE_TAG>multimodal representations</POST_TAG>. Our method separately encodes masked-out multispectral optical and synthetic aperture radar samples -- aligned in space and time -- and performs cross-modal <PRE_TAG>contrastive learning</POST_TAG>. Another encoder fuses these sensors, producing joint <PRE_TAG>multimodal encodings</POST_TAG> that are used to predict the masked patches via a lightweight decoder. We show that these objectives are complementary when leveraged on <PRE_TAG>spatially aligned</POST_TAG> <PRE_TAG>multimodal data</POST_TAG>. We also introduce X- and 2D-ALiBi, which spatially biases our cross- and <PRE_TAG>self-attention matrices</POST_TAG>. These strategies improve representations and allow our models to effectively extrapolate to images up to 17.6x larger at test-time. CROMA outperforms the current SoTA multispectral model, evaluated on: four classification benchmarks -- finetuning (avg. 1.8%), linear (avg. 2.4%) and non<PRE_TAG>linear</POST_TAG> (avg. 1.4%) probing, kNN classification (avg. 3.5%), and K-means clustering (avg. 8.4%); and three segmentation benchmarks (avg. 6.4%). CROMA's rich, optionally <PRE_TAG>multimodal representations</POST_TAG> can be widely leveraged across remote sensing applications.
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper