Understanding Geospatial Foundation Model Predictions

Overview

Geospatial Foundation Models (GFMs) represent a paradigm shift in how we process and analyze Earth observation data. Like their counterparts in natural language processing (e.g., GPT) and computer vision (e.g., CLIP), GFMs learn powerful representations from vast amounts of unlabeled satellite imagery that can be adapted for numerous downstream tasks.

This document explores the journey from raw satellite data to specific predictions, explaining how pre-training and fine-tuning enable diverse applications in Earth observation.

The Foundation Model Architecture

Input Data Structure

Geospatial data presents unique challenges compared to traditional computer vision:

Spatial dimensions: Typically patches of 100×100 to 224×224 pixels
Spectral dimensions: Multiple bands beyond RGB (e.g., NIR, SWIR, thermal)
Temporal dimensions: Time series of observations (e.g., weekly, monthly)

For example, a typical input might be structured as:

3 bands × 100×100 pixels × 12 time steps

This creates a high-dimensional data cube that captures how Earth’s surface changes across space, spectrum, and time.

The Encoder-Decoder Framework

flowchart LR
    A["Satellite Data<br>Spatial×Spectral×Temporal"] --> B["Encoder<br>Deep Learning"]
    B --> C["Embedding<br>Learned Vector<br>Representation"]
    C --> D["Decoder<br>Deep Learning"]
    D --> E["Task-Specific<br>Output"]
    
    C --> F["Alternative:<br>Traditional ML<br>e.g., Lasso Regression"]
    F --> G["Simple<br>Predictions"]

The foundation model architecture consists of:

Encoder: Transforms high-dimensional satellite data into compact, information-rich embeddings
Embedding: A learned vector representation (think of it as a “deep learning version of PCA”)
Decoder: Transforms embeddings back into meaningful outputs

Pre-training: Learning Without Labels

The power of foundation models comes from self-supervised pre-training on massive unlabeled datasets. Unlike traditional supervised learning, these approaches create training signals from the data itself.

Common Pre-training Objectives

1. Masked Autoencoding (MAE)

Task: Randomly mask patches of the input and predict the missing content
Intuition: Forces the model to understand spatial context and relationships
Example: Hide 75% of image patches and reconstruct them

# Conceptual example
masked_input = mask_random_patches(satellite_image, mask_ratio=0.75)
embedding = encoder(masked_input)
reconstruction = decoder(embedding)
loss = MSE(reconstruction, original_patches)

2. Temporal Prediction

Task: Predict the next time step or fill in missing temporal observations
Intuition: Learns seasonal patterns and temporal dynamics
Example: Given January-June data, predict July

4. Contrastive Learning

Task: Learn similar embeddings for nearby locations/times
Intuition: Captures spatial and temporal continuity
Example: Patches from the same field should have similar embeddings

Downstream Tasks: From General to Specific

Once pre-trained, GFMs can be adapted for various Earth observation tasks through fine-tuning or as feature extractors.

Task Categories

1. Pixel-Level Predictions (Semantic Segmentation)

Land Cover Classification - Input: Multi-spectral satellite imagery - Output: Per-pixel class labels (forest, urban, water, etc.) - Fine-tuning: Add segmentation head, train on labeled maps

Change Detection - Input: Multi-temporal image pairs - Output: Binary change masks or change type maps - Fine-tuning: Modify decoder for temporal comparisons

Cloud/Shadow Masking - Input: Multi-spectral imagery - Output: Binary masks for clouds and shadows - Fine-tuning: Lightweight decoder trained on quality masks

2. Image-Level Predictions

Scene Classification - Input: Image patches - Output: Single label per patch (agricultural, residential, etc.) - Fine-tuning: Replace decoder with classification head

Regression Tasks - Input: Image patches - Output: Continuous values (biomass, yield, poverty indicators) - Fine-tuning: Linear probe or shallow MLP on embeddings

3. Time Series Analysis

Crop Type Mapping - Input: Temporal sequence of observations - Output: Crop type per pixel/parcel - Fine-tuning: Temporal attention mechanisms

Phenology Detection - Input: Time series data - Output: Key dates (green-up, peak, senescence) - Fine-tuning: Specialized temporal decoders

Fine-tuning Strategies

1. Full Fine-tuning

Update all model parameters
Best for: Large labeled datasets, significant domain shift
Drawback: Computationally expensive, risk of overfitting

2. Linear Probing

Freeze encoder, train only classification head
Best for: Limited labeled data, similar domains
Benefit: Fast, prevents overfitting

3. Adapter Layers

Insert small trainable modules between frozen layers
Best for: Multiple tasks, parameter efficiency
Benefit: Task-specific adaptation with minimal parameters

4. Prompt Tuning

Learn task-specific input modifications
Best for: Very limited data, zero-shot scenarios
Benefit: Extremely parameter efficient

Example: From Pre-training to Land Cover Mapping

Let’s trace the journey for a land cover classification task:

Pre-training Phase

# Masked autoencoding on unlabeled Sentinel-2 data
for batch in massive_unlabeled_dataset:
    masked_input = random_mask(batch)
    embedding = encoder(masked_input)
    reconstruction = decoder(embedding)
    optimize(reconstruction_loss)

Fine-tuning Phase

# Freeze encoder, add segmentation head
encoder.freeze()
segmentation_head = SegmentationDecoder(num_classes=10)

# Train on labeled land cover data
for image, label_map in labeled_dataset:
    embedding = encoder(image)
    prediction = segmentation_head(embedding)
    optimize(cross_entropy_loss(prediction, label_map))

Inference Phase

# Apply to new imagery
new_image = load_sentinel2_scene()
embedding = encoder(new_image)
land_cover_map = segmentation_head(embedding)

Why This Approach Works

1. Data Efficiency

Pre-training on abundant unlabeled data reduces the need for expensive labeled datasets.

2. Transfer Learning

Features learned from global data transfer to local applications.

3. Multi-task Capability

One pre-trained model can be adapted for numerous downstream tasks.

4. Robustness

Exposure to diverse data during pre-training improves generalization.

5. Temporal Understanding

Unlike traditional CNN approaches, GFMs can natively handle time series.

Practical Considerations

Choosing Pre-training Objectives

For agricultural applications: Prioritize temporal objectives
For urban mapping: Focus on spatial detail and multi-scale features
For climate monitoring: Emphasize long-term temporal patterns

Data Requirements

Pre-training: Terabytes of unlabeled imagery
Fine-tuning: Can work with hundreds to thousands of labeled samples
Inference: Real-time processing possible with optimized models

Computational Resources

Pre-training: Requires significant GPU resources (days to weeks)
Fine-tuning: Feasible on single GPUs (hours to days)
Inference: Can be optimized for edge deployment

Future Directions

Foundation Models for Specific Domains
- Agriculture-specific models
- Urban-focused architectures
- Ocean and coastal specialists
Multi-modal Foundation Models
- Combining optical, SAR, and hyperspectral data
- Integration with weather and climate data
- Fusion with ground-based sensors
Efficient Architectures
- Lightweight models for edge computing
- Quantization and pruning techniques
- Neural architecture search for Earth observation
Interpretability
- Understanding what features the model learns
- Explainable predictions for decision support
- Uncertainty quantification

Summary

Geospatial Foundation Models represent a powerful approach to Earth observation, transforming how we extract information from satellite data. Through self-supervised pre-training on massive unlabeled datasets, these models learn rich representations that can be efficiently adapted for diverse downstream tasks. Whether predicting land cover, detecting changes, or monitoring crop growth, GFMs provide a flexible and powerful framework for understanding our changing planet.

The key insight is that the expensive process of learning good representations can be done once on unlabeled data, then reused many times for different applications with minimal additional training. This democratizes access to advanced Earth observation capabilities and accelerates the development of new applications.

As we continue to accumulate Earth observation data at unprecedented rates, foundation models will become increasingly important for transforming this data deluge into actionable insights for science, policy, and society.

Available Foundation Models

Several geospatial foundation models are now available for research and application:

Open Source Models

Prithvi - NASA/IBM’s 100M parameter model trained on HLS data
Clay - Open foundation model for environmental monitoring
SatMAE - Masked autoencoder for temporal-spatial satellite data
GeoSAM - Segment Anything adapted for Earth observation
SpectralGPT - Foundation model for spectral remote sensing

Libraries and Frameworks

TorchGeo - PyTorch library with pre-trained models
TerraTorch - Flexible framework for Earth observation deep learning
MMEARTH - Multi-modal Earth observation models

Resources and Benchmarks

Awesome Remote Sensing Foundation Models - Comprehensive collection
GEO-Bench - Benchmark for evaluating GFMs
PhilEO Bench - ESA’s Earth observation benchmark

Visualization Resources

To generate architectural diagrams for this explainer, you can run the provided visualization script:

cd book/extras/scripts
python visualize_gfm_architecture.py

This will create three diagrams in the book/extras/images/ directory:

gfm_architecture.png: Overview of the encoder-decoder architecture
gfm_pretraining_tasks.png: Examples of self-supervised pre-training objectives
gfm_task_hierarchy.png: Taxonomy of downstream tasks enabled by GFMs

--- title: "Understanding Geospatial Foundation Model Predictions" subtitle: "From Self-Supervised Pre-training to Task-Specific Applications" author: "GeoAI Course Team" date: today format: html: toc: true toc-depth: 3 code-fold: true --- ## Overview Geospatial Foundation Models (GFMs) represent a paradigm shift in how we process and analyze Earth observation data. Like their counterparts in natural language processing (e.g., GPT) and computer vision (e.g., CLIP), GFMs learn powerful representations from vast amounts of unlabeled satellite imagery that can be adapted for numerous downstream tasks. This document explores the journey from raw satellite data to specific predictions, explaining how pre-training and fine-tuning enable diverse applications in Earth observation. ## The Foundation Model Architecture ### Input Data Structure Geospatial data presents unique challenges compared to traditional computer vision: - **Spatial dimensions**: Typically patches of 100×100 to 224×224 pixels - **Spectral dimensions**: Multiple bands beyond RGB (e.g., NIR, SWIR, thermal) - **Temporal dimensions**: Time series of observations (e.g., weekly, monthly) For example, a typical input might be structured as: ``` 3 bands × 100×100 pixels × 12 time steps ``` This creates a high-dimensional data cube that captures how Earth's surface changes across space, spectrum, and time. ### The Encoder-Decoder Framework ```{mermaid} flowchart LR A["Satellite Data Spatial×Spectral×Temporal"] --> B["Encoder Deep Learning"] B --> C["Embedding Learned Vector Representation"] C --> D["Decoder Deep Learning"] D --> E["Task-Specific Output"] C --> F["Alternative: Traditional ML e.g., Lasso Regression"] F --> G["Simple Predictions"] ``` The foundation model architecture consists of: 1. **Encoder**: Transforms high-dimensional satellite data into compact, information-rich embeddings 2. **Embedding**: A learned vector representation (think of it as a "deep learning version of PCA") 3. **Decoder**: Transforms embeddings back into meaningful outputs ## Pre-training: Learning Without Labels The power of foundation models comes from self-supervised pre-training on massive unlabeled datasets. Unlike traditional supervised learning, these approaches create training signals from the data itself. ### Common Pre-training Objectives #### 1. Masked Autoencoding (MAE) - **Task**: Randomly mask patches of the input and predict the missing content - **Intuition**: Forces the model to understand spatial context and relationships - **Example**: Hide 75% of image patches and reconstruct them ```python # Conceptual example masked_input = mask_random_patches(satellite_image, mask_ratio=0.75) embedding = encoder(masked_input) reconstruction = decoder(embedding) loss = MSE(reconstruction, original_patches) ``` #### 2. Temporal Prediction - **Task**: Predict the next time step or fill in missing temporal observations - **Intuition**: Learns seasonal patterns and temporal dynamics - **Example**: Given January-June data, predict July #### 3. Multi-modal Alignment - **Task**: Align embeddings from different sensors or modalities - **Intuition**: Learns invariant features across different data sources - **Example**: Match Sentinel-2 optical with Sentinel-1 SAR data #### 4. Contrastive Learning - **Task**: Learn similar embeddings for nearby locations/times - **Intuition**: Captures spatial and temporal continuity - **Example**: Patches from the same field should have similar embeddings ## Downstream Tasks: From General to Specific Once pre-trained, GFMs can be adapted for various Earth observation tasks through fine-tuning or as feature extractors. ### Task Categories #### 1. Pixel-Level Predictions (Semantic Segmentation) **Land Cover Classification** - **Input**: Multi-spectral satellite imagery - **Output**: Per-pixel class labels (forest, urban, water, etc.) - **Fine-tuning**: Add segmentation head, train on labeled maps **Change Detection** - **Input**: Multi-temporal image pairs - **Output**: Binary change masks or change type maps - **Fine-tuning**: Modify decoder for temporal comparisons **Cloud/Shadow Masking** - **Input**: Multi-spectral imagery - **Output**: Binary masks for clouds and shadows - **Fine-tuning**: Lightweight decoder trained on quality masks #### 2. Image-Level Predictions **Scene Classification** - **Input**: Image patches - **Output**: Single label per patch (agricultural, residential, etc.) - **Fine-tuning**: Replace decoder with classification head **Regression Tasks** - **Input**: Image patches - **Output**: Continuous values (biomass, yield, poverty indicators) - **Fine-tuning**: Linear probe or shallow MLP on embeddings #### 3. Time Series Analysis **Crop Type Mapping** - **Input**: Temporal sequence of observations - **Output**: Crop type per pixel/parcel - **Fine-tuning**: Temporal attention mechanisms **Phenology Detection** - **Input**: Time series data - **Output**: Key dates (green-up, peak, senescence) - **Fine-tuning**: Specialized temporal decoders #### 4. Multi-modal Fusion **Data Gap Filling** - **Input**: Partial observations from multiple sensors - **Output**: Complete, harmonized time series - **Fine-tuning**: Cross-attention between modalities **Super-resolution** - **Input**: Low-resolution imagery - **Output**: High-resolution reconstruction - **Fine-tuning**: Specialized upsampling decoders ## Fine-tuning Strategies ### 1. Full Fine-tuning - Update all model parameters - Best for: Large labeled datasets, significant domain shift - Drawback: Computationally expensive, risk of overfitting ### 2. Linear Probing - Freeze encoder, train only classification head - Best for: Limited labeled data, similar domains - Benefit: Fast, prevents overfitting ### 3. Adapter Layers - Insert small trainable modules between frozen layers - Best for: Multiple tasks, parameter efficiency - Benefit: Task-specific adaptation with minimal parameters ### 4. Prompt Tuning - Learn task-specific input modifications - Best for: Very limited data, zero-shot scenarios - Benefit: Extremely parameter efficient ## Example: From Pre-training to Land Cover Mapping Let's trace the journey for a land cover classification task: 1. **Pre-training Phase** ```python # Masked autoencoding on unlabeled Sentinel-2 data for batch in massive_unlabeled_dataset: masked_input = random_mask(batch) embedding = encoder(masked_input) reconstruction = decoder(embedding) optimize(reconstruction_loss) ``` 2. **Fine-tuning Phase** ```python # Freeze encoder, add segmentation head encoder.freeze() segmentation_head = SegmentationDecoder(num_classes=10) # Train on labeled land cover data for image, label_map in labeled_dataset: embedding = encoder(image) prediction = segmentation_head(embedding) optimize(cross_entropy_loss(prediction, label_map)) ``` 3. **Inference Phase** ```python # Apply to new imagery new_image = load_sentinel2_scene() embedding = encoder(new_image) land_cover_map = segmentation_head(embedding) ``` ## Why This Approach Works ### 1. **Data Efficiency** Pre-training on abundant unlabeled data reduces the need for expensive labeled datasets. ### 2. **Transfer Learning** Features learned from global data transfer to local applications. ### 3. **Multi-task Capability** One pre-trained model can be adapted for numerous downstream tasks. ### 4. **Robustness** Exposure to diverse data during pre-training improves generalization. ### 5. **Temporal Understanding** Unlike traditional CNN approaches, GFMs can natively handle time series. ## Practical Considerations ### Choosing Pre-training Objectives - **For agricultural applications**: Prioritize temporal objectives - **For urban mapping**: Focus on spatial detail and multi-scale features - **For climate monitoring**: Emphasize long-term temporal patterns ### Data Requirements - **Pre-training**: Terabytes of unlabeled imagery - **Fine-tuning**: Can work with hundreds to thousands of labeled samples - **Inference**: Real-time processing possible with optimized models ### Computational Resources - **Pre-training**: Requires significant GPU resources (days to weeks) - **Fine-tuning**: Feasible on single GPUs (hours to days) - **Inference**: Can be optimized for edge deployment ## Future Directions 1. **Foundation Models for Specific Domains** - Agriculture-specific models - Urban-focused architectures - Ocean and coastal specialists 2. **Multi-modal Foundation Models** - Combining optical, SAR, and hyperspectral data - Integration with weather and climate data - Fusion with ground-based sensors 3. **Efficient Architectures** - Lightweight models for edge computing - Quantization and pruning techniques - Neural architecture search for Earth observation 4. **Interpretability** - Understanding what features the model learns - Explainable predictions for decision support - Uncertainty quantification ## Summary Geospatial Foundation Models represent a powerful approach to Earth observation, transforming how we extract information from satellite data. Through self-supervised pre-training on massive unlabeled datasets, these models learn rich representations that can be efficiently adapted for diverse downstream tasks. Whether predicting land cover, detecting changes, or monitoring crop growth, GFMs provide a flexible and powerful framework for understanding our changing planet. The key insight is that the expensive process of learning good representations can be done once on unlabeled data, then reused many times for different applications with minimal additional training. This democratizes access to advanced Earth observation capabilities and accelerates the development of new applications. As we continue to accumulate Earth observation data at unprecedented rates, foundation models will become increasingly important for transforming this data deluge into actionable insights for science, policy, and society. ## Available Foundation Models Several geospatial foundation models are now available for research and application: ### Open Source Models - **[Prithvi](https://github.com/NASA-IMPACT/hls-foundation-os)** - NASA/IBM's 100M parameter model trained on HLS data - **[Clay](https://github.com/Clay-foundation/model)** - Open foundation model for environmental monitoring - **[SatMAE](https://github.com/sustainlab-group/SatMAE)** - Masked autoencoder for temporal-spatial satellite data - **[GeoSAM](https://github.com/coolzhao/Geo-SAM)** - Segment Anything adapted for Earth observation - **[SpectralGPT](https://github.com/danfenghong/IEEE_TPAMI_SpectralGPT)** - Foundation model for spectral remote sensing ### Libraries and Frameworks - **[TorchGeo](https://github.com/microsoft/torchgeo)** - PyTorch library with pre-trained models - **[TerraTorch](https://github.com/IBM/terratorch)** - Flexible framework for Earth observation deep learning - **[MMEARTH](https://github.com/bair-climate-initiative/mmearth)** - Multi-modal Earth observation models ### Resources and Benchmarks - **[Awesome Remote Sensing Foundation Models](https://github.com/Jack-bo1220/Awesome-Remote-Sensing-Foundation-Models)** - Comprehensive collection - **[GEO-Bench](https://github.com/ServiceNow/geo-bench)** - Benchmark for evaluating GFMs - **[PhilEO Bench](https://github.com/ESA-PhiLab/PhilEO-Bench)** - ESA's Earth observation benchmark ## Visualization Resources To generate architectural diagrams for this explainer, you can run the provided visualization script: ```bash cd book/extras/scripts python visualize_gfm_architecture.py ``` This will create three diagrams in the `book/extras/images/` directory: - `gfm_architecture.png`: Overview of the encoder-decoder architecture - `gfm_pretraining_tasks.png`: Examples of self-supervised pre-training objectives - `gfm_task_hierarchy.png`: Taxonomy of downstream tasks enabled by GFMs