license: apache-2.0
Model and Inputs
Prithvi is a first-of-its-kind temporal Vision transformer pre-trained by the IBM and NASA team on contiguous US Harmonised Landsat Sentinel 2 (HLS) data. The model adopts a self-supervised encoder developed with a ViT architecture and Masked AutoEncoder (MAE) learning strategy, with an MSE loss function. The model includes spatial attention across multiple patches and also temporal attention for each patch.
The model accepts remote sensing data in a video format (B, C, T, H, W). Note that the temporal dimension (T) is very important in this application and not present in most other works around remote sensing modeling. The ability to handle a time series of remote sensing images can benefit a variety of downstream tasks (e.g. Burn Scars segmentation, Flood Segmentation, Land Cover Classification). The model can also handle static imagery which can be fed into the model with T=1.
Pre-training
The model was pre-trained with NASA's HLS V2 L30 product (30m granularity) from the contiguous United States. The bands that were used are the following:
- Blue
- Green
- Red
- Narrow NIR
- SWIR 1
- SWIR 2
Code
The model follows the original MAE repo with some modifications including:
- replace 2D patch embed with 3D patch embed;
- replace 2D positional embed with 3D positional embed;
- replace 2D patchify and unpatchify with 3D.
- adding infrared bands besides RGB
Inference and demo
There is an inference script (Prithvi_run_inference.py
) that allows to run the image reconstruction on a set of HLS images assumed to be from the same location at different time steps(see example below). These should be provided in chronological order in geotiff format, including the channels described above (Blue, Green, Red, Narrow NIR, SWIR 1, SWIR 2) in reflectance units. There is also a demo that leverages the same code here.
python Prithvi_run_inference.py --data_files t1.tif t2.tif t3.tif --yaml_file_path /path/to/yaml/Prithvi_100.yaml --checkpoint /path/to/checkpoint/Prithvi_100.pth --output_dir /path/to/out/dir/ --input_indices <space separated 0-based indices of channels to select from input> --mask_ratio 0.5 --img_size <length of one side of square input shape>
This demo is a starting point that can be used as a starting point to generalize to different input shapes / types.
Finetuning examples
Examples of finetuning the model for image segmentation using the mmsegmentation library are available through Hugging Face (e.g. burn scars segmentation, flood mapping, and multi temporal crop classification), with the code used for the experiments available on github. This also contains instructions to finetune the model for flood detection on the popular open access sen1floods11 dataset.
Feedback
Your feedback is invaluable to us. If you have any feedback about the model, please feel free to share it with us. You can do this by submitting issues on our open-source repository, hls-foundation-os, on GitHub.
Citation
If this model helped your research, please cite Prithvi-V2
in your publications. Here are two BibTeX entries as examples:
@article{Prithvi-2-preprint,
author = {},
title = {{Title}},
journal = {Preprint Available on arxiv},
year = {2024}
}