SegEarth-OV3: Exploring SAM 3 for Open-Vocabulary Semantic Segmentation in Remote Sensing Images

Abstract

Most existing methods for training-free open-vocabulary semantic segmentation are based on CLIP.

While these approaches have made progress, they often face challenges in precise localization or require complex pipelines to combine separate modules, especially in remote sensing scenarios where numerous dense and small targets are present.

Recently, Segment Anything Model 3 (SAM 3) was proposed, unifying segmentation and recognition in a promptable framework.

In this paper, we present a comprehensive exploration of applying SAM 3 to the remote sensing open-vocabulary tasks (i.e., 2D semantic segmentation, change detection, and 3D semantic segmentation) without any training.

First, we implement a mask fusion strategy that combines the outputs from SAM 3’s semantic segmentation head and the Transformer decoder (instance head).

This allows us to leverage the strengths of both heads for better land coverage.

Second, we utilize the presence score from the presence head to filter out categories that do not exist in the scene, reducing false positives caused by the vast vocabulary sizes and patch-level processing in geospatial scenes.

Furthermore, we extend our method to open-vocabulary change detection by a joint instance- and pixel-level verification strategy built directly upon our fused logits.

We evaluate our method on extensive remote sensing datasets and tasks, including 20 segmentation datasets, 3 change detection datasets, and a 3D segmentation dataset.

Experiments show that our method achieves promising performance, demonstrating the potential of SAM 3 for remote sensing open-vocabulary tasks. Our code is released at https://github.com/earth-insights/SegEarth-OV-3.

Preliminaries

SAM 3 Architecture

Visual embeddings and prompt embeddings are processed by an attention-based fusion encoder, which outputs prompt-conditioned features:

$$\mathbf{F}_{\mathrm{cond}} = \text{FusionEncoder}(\mathbf{F}_{\mathrm{visual}}, \mathbf{F}_{\mathrm{text}})$$

Presence Head: A DETR-based architecture that predicts the presence of categories in the image. If a category is predicted as present, SAM 3 proceeds to instantiate object tokens for that concept and then runs the mask decoder to produce high‑resolution masks
Semantic Segmentation Head: An FCN-style decoder that produces pixel-wise segmentation maps for each category in the vocabulary. This head is designed to capture fine-grained details and is particularly effective for segmenting small and dense objects, which are common in remote sensing images.
Instance Head: A MaskFormer-style decoder that generates instance-level masks. It generates N instance predictions $\left\{\left(P_{\mathrm{inst}}^{(k)}, P_{\mathrm{conf}}^{(k)}\right)\right\}_{i=1}^N$ , where $P_{\mathrm{inst}}^{(k)}$ is the predicted probability map for the k-th instance and $P_{\mathrm{conf}}^{(k)}$ is the confidence score for that instance.

Limitations of Naïve SAM 3 Inference

Directly applying SAM 3 to remote sensing open-vocabulary segmentation faces three practical gaps:

Instance outputs are sparse and fragmented. The instance head predicts object-level masks, but downstream evaluation requires dense semantic maps over all pixels.
Single-head prediction is biased. The semantic head is good at preserving continuity for amorphous regions.
Large vocabularies induce false positives. In patch-based geospatial inference, many queried categories are absent in each patch, yet still receive noisy responses.

Fragmented masks

Naïve SAM 3 generates fragmented instance masks for amorphous regions

Pipeline

To address these issues, SegEarth-OV3 introduces three training-free components: Instance Aggregation, Dual-Head Mask Fusion, and Presence-Guided Filtering.

Instance Aggregation

For countable objects like vehicles, semantic header’s predictions are often incomplete and blurry, while instance head can produce sharper masks.

Aggregates sparse instance-level predictions into a category-level probability map $\mathbf{P}_{\mathrm{inst\_agg}}$ :

$$\mathbf{P}_{\mathrm{inst\_agg}} = \max_{k=1}^N\left(\mathbf{P}_{\mathrm{inst}}^{(k)}\left(h, w\right) \cdot s_{\mathrm{conf}}^{(k)}\right).$$

if self.use_transformer_decoder:
    if inference_state['masks_logits'].shape[0] > 0:
        inst_len = inference_state['masks_logits'].shape[0]
        for inst_id in range(inst_len):
            instance_logits = inference_state['masks_logits'][inst_id].squeeze()
            instance_score = inference_state['object_score'][inst_id]
            # instance_mask = inference_state['masks'][inst_id].squeeze()
            
            # Handle potential dimension mismatch if SAM3 output differs slightly
            if instance_logits.shape != (h, w):
                instance_logits = F.interpolate(
                    instance_logits.view(1, 1, *instance_logits.shape), 
                    size=(h, w), 
                    mode='bilinear', 
                    align_corners=False
                ).squeeze()

            seg_logits[query_idx] = torch.max(seg_logits[query_idx], instance_logits * instance_score)

Dual-Head Mask Fusion

Instance head excels at delineating countable objects, e.g., vehicles.
Semantic head excels at segmenting amorphous regions, e.g., roads.

$$\mathbf{P}_{\mathrm{fused}}(h, w) = \max\left(\mathbf{P}_{\mathrm{inst\_agg}}(h, w), \mathbf{P}_{\mathrm{sem}}(h, w)\right)$$

if self.use_sem_seg:
    semantic_logits = inference_state['semantic_mask_logits']
    if semantic_logits.shape != (h, w):
            semantic_logits = F.interpolate(
                semantic_logits, 
                size=(h, w), 
                mode='bilinear', 
                align_corners=False
            ).squeeze()
    
    seg_logits[query_idx] = torch.max(seg_logits[query_idx], semantic_logits)

Presence-Guided Filtering

Vocabulary can cover a wide range of concepts. The model is prone to hallucinating categories that do not exist in the scene due to textural ambiguity. Concepts can collide with each other, e.g., “low vegetation” vs. “sports field”.

Presence head outputs a presence score $\mathbf{P}_{\mathrm{pres}}$ , which is indicates the likelihood of each category being present in the image. This can be used to reduce false positives by filtering out categories with low presence scores:

$$\mathbf{P}_{\mathrm{final}}^{(c)} = \mathbf{P}_{\mathrm{fused}}^{(c)} \cdot \mathbf{P}_{\mathrm{pres}}^{(c)}.$$

Finally, assign each pixel to the category with the highest final probability:

$$\mathbf{M}(h, w) = \arg\max_{c} \mathbf{P}_{\mathrm{final}}^{(c)}(h, w).$$

if self.use_presence_score:
    seg_logits[query_idx] = seg_logits[query_idx] * inference_state["presence_score"]

Experiments

Main Results

The quality of vocabulary is crucial for open vocabulary learning. Taking iSAID as example, it has 15 categories, some of which are very similar and easily causing ambiguity, e.g., “small vehicle” vs. “large vehicle”, “tennis field” vs. “soccer ball field”.

Downstream Tasks

Performance on GF Datasets

CLIP’s text encoder is trained on far more data than SAM 3’s. SAM 3’s text encoder focus on segmentation tasks and accept only short, structured prompts, wheras CLIP’s visual encoder can take long, loose sentences and more abstract vocabulary. In this sense, CLIP remains more powerful for general semantic understanding.

The paper ascribes the suboptimal performance on GF datasets to the fact that GID has much lower spatial resolution than UAV datasets, which is not very convincing.

GID is a large-scale land-cover dataset with 150 categories, consisting of closely related classes with subtle differences, e.g., “meadow” vs. “farmland”, which confuses SAM 3’s text encoder and leads to many false positives (categorial hallucination). The suboptimal performance on GID suggests that SAM 3 struggles with very large vocabularies and fine-grained category distinctions, which are common in remote sensing.

Performance on General Scene Datasets

Ablation

Visualization

Extension (Still in Progress)

OVCD (Open Vocabulary Change Detection)

3D Point Cloud Segmentation

Discussion

SAM 3 is a more promising foundation than CLIP for training-free open-vocabulary segmentation, but its text encoder still is not as powerful as CLIP’s for general semantic understanding.
CLIP’s text encoder may help mitigate the categorial hallucination issue in SAM 3.
A feasible line of work is to exploit CLIP’s strong text encoder to generate better prompts for SAM 3, which can then leverage its superior segmentation capabilities to produce more accurate masks.