flowchart LR A["Satellite Data<br>Spatial×Spectral×Temporal"] --> B["Encoder<br>Deep Learning"] B --> C["Embedding<br>Learned Vector<br>Representation"] C --> D["Decoder<br>Deep Learning"] D --> E["Task-Specific<br>Output"] C --> F["Alternative:<br>Traditional ML<br>e.g., Lasso Regression"] F --> G["Simple<br>Predictions"]
Overview
Geospatial Foundation Models (GFMs) represent a paradigm shift in how we process and analyze Earth observation data. Like their counterparts in natural language processing (e.g., GPT) and computer vision (e.g., CLIP), GFMs learn powerful representations from vast amounts of unlabeled satellite imagery that can be adapted for numerous downstream tasks.
This document explores the journey from raw satellite data to specific predictions, explaining how pre-training and fine-tuning enable diverse applications in Earth observation.
The Foundation Model Architecture
Input Data Structure
Geospatial data presents unique challenges compared to traditional computer vision:
- Spatial dimensions: Typically patches of 100×100 to 224×224 pixels
- Spectral dimensions: Multiple bands beyond RGB (e.g., NIR, SWIR, thermal)
- Temporal dimensions: Time series of observations (e.g., weekly, monthly)
For example, a typical input might be structured as:
3 bands × 100×100 pixels × 12 time steps
This creates a high-dimensional data cube that captures how Earth’s surface changes across space, spectrum, and time.
The Encoder-Decoder Framework
The foundation model architecture consists of:
- Encoder: Transforms high-dimensional satellite data into compact, information-rich embeddings
- Embedding: A learned vector representation (think of it as a “deep learning version of PCA”)
- Decoder: Transforms embeddings back into meaningful outputs
Pre-training: Learning Without Labels
The power of foundation models comes from self-supervised pre-training on massive unlabeled datasets. Unlike traditional supervised learning, these approaches create training signals from the data itself.
Common Pre-training Objectives
1. Masked Autoencoding (MAE)
- Task: Randomly mask patches of the input and predict the missing content
- Intuition: Forces the model to understand spatial context and relationships
- Example: Hide 75% of image patches and reconstruct them
# Conceptual example
= mask_random_patches(satellite_image, mask_ratio=0.75)
masked_input = encoder(masked_input)
embedding = decoder(embedding)
reconstruction = MSE(reconstruction, original_patches) loss
2. Temporal Prediction
- Task: Predict the next time step or fill in missing temporal observations
- Intuition: Learns seasonal patterns and temporal dynamics
- Example: Given January-June data, predict July
3. Multi-modal Alignment
- Task: Align embeddings from different sensors or modalities
- Intuition: Learns invariant features across different data sources
- Example: Match Sentinel-2 optical with Sentinel-1 SAR data
4. Contrastive Learning
- Task: Learn similar embeddings for nearby locations/times
- Intuition: Captures spatial and temporal continuity
- Example: Patches from the same field should have similar embeddings
Downstream Tasks: From General to Specific
Once pre-trained, GFMs can be adapted for various Earth observation tasks through fine-tuning or as feature extractors.
Task Categories
1. Pixel-Level Predictions (Semantic Segmentation)
Land Cover Classification - Input: Multi-spectral satellite imagery - Output: Per-pixel class labels (forest, urban, water, etc.) - Fine-tuning: Add segmentation head, train on labeled maps
Change Detection - Input: Multi-temporal image pairs - Output: Binary change masks or change type maps - Fine-tuning: Modify decoder for temporal comparisons
Cloud/Shadow Masking - Input: Multi-spectral imagery - Output: Binary masks for clouds and shadows - Fine-tuning: Lightweight decoder trained on quality masks
2. Image-Level Predictions
Scene Classification - Input: Image patches - Output: Single label per patch (agricultural, residential, etc.) - Fine-tuning: Replace decoder with classification head
Regression Tasks - Input: Image patches - Output: Continuous values (biomass, yield, poverty indicators) - Fine-tuning: Linear probe or shallow MLP on embeddings
3. Time Series Analysis
Crop Type Mapping - Input: Temporal sequence of observations - Output: Crop type per pixel/parcel - Fine-tuning: Temporal attention mechanisms
Phenology Detection - Input: Time series data - Output: Key dates (green-up, peak, senescence) - Fine-tuning: Specialized temporal decoders
4. Multi-modal Fusion
Data Gap Filling - Input: Partial observations from multiple sensors - Output: Complete, harmonized time series - Fine-tuning: Cross-attention between modalities
Super-resolution - Input: Low-resolution imagery - Output: High-resolution reconstruction - Fine-tuning: Specialized upsampling decoders
Fine-tuning Strategies
1. Full Fine-tuning
- Update all model parameters
- Best for: Large labeled datasets, significant domain shift
- Drawback: Computationally expensive, risk of overfitting
2. Linear Probing
- Freeze encoder, train only classification head
- Best for: Limited labeled data, similar domains
- Benefit: Fast, prevents overfitting
3. Adapter Layers
- Insert small trainable modules between frozen layers
- Best for: Multiple tasks, parameter efficiency
- Benefit: Task-specific adaptation with minimal parameters
4. Prompt Tuning
- Learn task-specific input modifications
- Best for: Very limited data, zero-shot scenarios
- Benefit: Extremely parameter efficient
Example: From Pre-training to Land Cover Mapping
Let’s trace the journey for a land cover classification task:
Pre-training Phase
# Masked autoencoding on unlabeled Sentinel-2 data for batch in massive_unlabeled_dataset: = random_mask(batch) masked_input = encoder(masked_input) embedding = decoder(embedding) reconstruction optimize(reconstruction_loss)
Fine-tuning Phase
# Freeze encoder, add segmentation head encoder.freeze()= SegmentationDecoder(num_classes=10) segmentation_head # Train on labeled land cover data for image, label_map in labeled_dataset: = encoder(image) embedding = segmentation_head(embedding) prediction optimize(cross_entropy_loss(prediction, label_map))
Inference Phase
# Apply to new imagery = load_sentinel2_scene() new_image = encoder(new_image) embedding = segmentation_head(embedding) land_cover_map
Why This Approach Works
1. Data Efficiency
Pre-training on abundant unlabeled data reduces the need for expensive labeled datasets.
2. Transfer Learning
Features learned from global data transfer to local applications.
3. Multi-task Capability
One pre-trained model can be adapted for numerous downstream tasks.
4. Robustness
Exposure to diverse data during pre-training improves generalization.
5. Temporal Understanding
Unlike traditional CNN approaches, GFMs can natively handle time series.
Practical Considerations
Choosing Pre-training Objectives
- For agricultural applications: Prioritize temporal objectives
- For urban mapping: Focus on spatial detail and multi-scale features
- For climate monitoring: Emphasize long-term temporal patterns
Data Requirements
- Pre-training: Terabytes of unlabeled imagery
- Fine-tuning: Can work with hundreds to thousands of labeled samples
- Inference: Real-time processing possible with optimized models
Computational Resources
- Pre-training: Requires significant GPU resources (days to weeks)
- Fine-tuning: Feasible on single GPUs (hours to days)
- Inference: Can be optimized for edge deployment
Future Directions
- Foundation Models for Specific Domains
- Agriculture-specific models
- Urban-focused architectures
- Ocean and coastal specialists
- Multi-modal Foundation Models
- Combining optical, SAR, and hyperspectral data
- Integration with weather and climate data
- Fusion with ground-based sensors
- Efficient Architectures
- Lightweight models for edge computing
- Quantization and pruning techniques
- Neural architecture search for Earth observation
- Interpretability
- Understanding what features the model learns
- Explainable predictions for decision support
- Uncertainty quantification
Summary
Geospatial Foundation Models represent a powerful approach to Earth observation, transforming how we extract information from satellite data. Through self-supervised pre-training on massive unlabeled datasets, these models learn rich representations that can be efficiently adapted for diverse downstream tasks. Whether predicting land cover, detecting changes, or monitoring crop growth, GFMs provide a flexible and powerful framework for understanding our changing planet.
The key insight is that the expensive process of learning good representations can be done once on unlabeled data, then reused many times for different applications with minimal additional training. This democratizes access to advanced Earth observation capabilities and accelerates the development of new applications.
As we continue to accumulate Earth observation data at unprecedented rates, foundation models will become increasingly important for transforming this data deluge into actionable insights for science, policy, and society.
Available Foundation Models
Several geospatial foundation models are now available for research and application:
Open Source Models
- Prithvi - NASA/IBM’s 100M parameter model trained on HLS data
- Clay - Open foundation model for environmental monitoring
- SatMAE - Masked autoencoder for temporal-spatial satellite data
- GeoSAM - Segment Anything adapted for Earth observation
- SpectralGPT - Foundation model for spectral remote sensing
Libraries and Frameworks
- TorchGeo - PyTorch library with pre-trained models
- TerraTorch - Flexible framework for Earth observation deep learning
- MMEARTH - Multi-modal Earth observation models
Resources and Benchmarks
- Awesome Remote Sensing Foundation Models - Comprehensive collection
- GEO-Bench - Benchmark for evaluating GFMs
- PhilEO Bench - ESA’s Earth observation benchmark
Visualization Resources
To generate architectural diagrams for this explainer, you can run the provided visualization script:
cd book/extras/scripts
python visualize_gfm_architecture.py
This will create three diagrams in the book/extras/images/
directory:
gfm_architecture.png
: Overview of the encoder-decoder architecturegfm_pretraining_tasks.png
: Examples of self-supervised pre-training objectivesgfm_task_hierarchy.png
: Taxonomy of downstream tasks enabled by GFMs