SAMRefiner: Taming Segment Anything Model for Universal Mask Refinement

Abstract

In this paper, we explore a principal way to enhance the quality of widely preexisting coarse masks, enabling them to serve as reliable training data for segmentation models to reduce the annotation cost.

In contrast to prior refinement techniques that are tailored to specific models or tasks in a close-world manner, we propose SAMRefiner, a universal and efficient approach by adapting SAM to the mask refinement task.

In other words, the segmentation quality of SAM is highly dependent on the quality of input prompts.

The core technique of our model is the noise-tolerant prompting scheme.

Specifically, we introduce a multi-prompt excavation strategy to mine diverse input prompts for SAM (i.e, distance-guided points, context-aware elastic bounding boxes, and Gaussian-style masks) from initial coarse masks.

These prompts can collaborate with each other to mitigate the effect of defects in coarse masks.

In particular, considering the difficulty of SAM to handle the multi-object case in semantic segmentation, we introduce a split-then-merge (STM) pipeline.

Additionally, we extend our method to SAMRefiner++ by introducing an additional IoU adaption step to further boost the performance of the generic SAMRefiner on the target dataset.

This step is self-boosted and requires no additional annotation.

The proposed framework is versatile and can flexibly cooperate with existing segmentation methods.

We evaluate our mask framework on a wide range of benchmarks under different settings, demonstrating better accuracy and efficiency.

SAMRefiner holds significant potential to expedite the evolution of refinement tools.

Our code is available at SAMRefiner.

Overview

Prompt Excavation

SAM Recap

SAM supports three types of input prompts: point prompts, box prompts, and mask prompts.

In practice, point prompts and box prompts are more commonly used due to their flexibility and ease of specification.

Mask prompts are typically used as an auxiliary for refinement. When box prompts fail to work, mask prompts can provide additional context to guide the segmentation.

The Segment Anything Model (SAM) achieves remarkable promptable segmentation given high-quality prompts which, however, often require good skills to specify. … Our key finding reveals that given such low-quality prompts, SAM’s mask decoder tends to activate image features that are biased towards the background, or confined to specific object parts. [1]

Point Prompts

SAM supports two types of point prompts: positive points and negative points.

# Example: two points, one positive, one negative
input_points = np.array([[100, 150], [200, 250]])  # coordinates
input_labels = np.array([1, 0])  # 1 = positive, 0 = negative

masks, scores, logits = predictor.predict(
    point_coords=input_points,
    point_labels=input_labels,
    multimask_output=True
)

Object centric prior: The center of an object tends to be positive and feature-discriminative, while uncertainty is mostly located along boundaries.

Point prompts are excavated based on the following distance guided criteria:

Positive prompt: The foreground point that has a maximum distance to the nearest background position.
Negative prompt:
- The point is farthest away from the foreground region.
- The point is within the bounding box of the foreground region.

Box Prompts

Box prompts contain more local context than point prompts do. However, they often introduce false-negative pixels, hindering the quality of the bounding box.

Context-aware elastic bounding box is proposed to alleviate this issue. Given $\mathcal{I} \in \mathbb{R}^{H \times W \times 3}$ , SAM encodes it into feature embedding $\mathcal{F}_{\mathrm{im}} \in \mathbb{R}^{h \times w \times c}$ . The coarse mask $\mathcal{M}_{\mathrm{coarse} \in \mathbb{R}^{H \times W}}$ is then downsampled to $\hat{\mathcal{M}} \in \mathbb{R}^{h \times w}$ . A query embedding is generated by averaging the image features within the coarse mask region:

$$\mathcal{F}_{\mathrm{query}} = \frac{1}{|\mathbb{1}_{\hat{\mathcal{M}} > 0}|}\sum\mathbb{1}_{\hat{\mathcal{M}} > 0}\left(\mathcal{F}_{\mathrm{im}}\right)$$

Query embedding is in $\mathbb{R}^{c}$ . $\mathcal{F}_{\mathrm{im}}$ is then upsampled to $\hat{\mathcal{F}_{\mathrm{im}}} \in \mathbb{R}^{H \times W \times c}$ and similarity map $\mathrm{Sim} \in \mathbb{R}^{H \times W}$ is computed as:

Upsampling the embedding $\mathcal{F}_{\mathrm{im}}$ is conditional: The embedding has to be obtained from applying a linear transform on the image features extracted by the image encoder of SAM.

$$\mathrm{Sim} = \left[\mathcal{F}_{\mathrm{query}} \cdot \hat{\mathcal{F}}_{\mathrm{im}}\right]_{\ge 0.5}$$

Box Prompts (Cont’d)

A bounding box $\mathcal{B}$ is elastic in four directions. In each direction, the side length can be enlarged by 10%, depending on the following condition:

If $\frac{\text{area of }\mathrm{Sim}_{\mathrm{context}}}{\text{area of }\mathcal{B}} > \lambda$ , where $\lambda$ is the threshold, i.e., margin in the code, then enlarge the box side in that direction.

steph = min(box_h * 0.1, 10)
stepw = min(box_w * 0.1, 10)
# ...
temp_x1 = int(x1-stepw)

if temp_x1 > 0 and temp_x1 < x1:
    context_area = (y2-y1) * (x1-temp_x1)
    sim_context = sim[y1:y2, temp_x1:x1]
    pos_area = sim_context.sum()
    if pos_area / context_area > margin:
        final_x1 = temp_x1
        changed = True

This operation is essentially expanding the receptive field. context_area stands for the “size” of the object, and pos_area stands for the bounding box that accomodates the object. If the ratio is high, it indicates that the object is large and the box should be enlarged to include more context.

Mask Prompts

Mask prompts can provide dense supervision to SAM, especially when box prompts are insufficient. Gaussian-style mask prompts are generated by applying Gaussian on the point prompt:

$$\mathrm{GM}(x, y) = \omega \cdot \exp\left(-\frac{(x - x_0)^2 + (y - y_0)^2}{|\mathbb{1}_{\mathcal{M}_{\mathrm{coarse}} > 0}| \cdot \gamma}\right),$$

where $(x_0, y_0)$ is the mask center point, selected based on the selection criteria of positive point prompt. $\omega$ and $\gamma$ are hyperparameters controlling the amplitude and span of the Gaussian mask.

Split-Then-Merge (STM) Pipeline

Coarse mask sometimes contains fragmented patches, which may or may not belong to the desired object. To address this issue, STM pipeline is proposed, which essentially is a morphological operation.

IoU Adaption

To further boost the performance of SAMRefiner on the target dataset, an additional IoU adaption step is proposed to form SAMRefiner++.

Given a prompt, SAM actually generates several candidate masks with IoU scores, represented by IoU tokens, thereby give rise to two mask selection stategies:

Single mask mode: selects the mask with the highest IoU score.
Multi mask mode: selects several masks according to a certain standard, e.g., top- $k$ , thresholding.

Select $n$ candidate masks $\{\mathcal{M}_i\}_{i=1}^n$ from SAM and compute their IoU scores $\{x_i\}_{i=1}^n$ . Then, compute coarse masks’ IoU scores $\{y_i\}_{i=1}^n$ with these candidate masks and select the best mask $\mathcal{M}_j$ .

$$\mathcal{L} = \sum_{i=1, i \ne j}^{n}\max\left(0, x_i - x_j + m\right)$$

Single-Prompt or Multi-Prompt?

A naïve idea is to provide more prompts to SAM, hoping that they can complement each other to yield better segmentation. However, experimental results show that this is not always the case. Less prompts give rise to more ambiguious masks, enabling coarse masks to provide effective guidance.

Visualization

Experiments

Comparisons

Ablation Studies

Combination with Different Baselines

Comparison with SoTAs

Inspirations

FeatUp [2] may be a wise substitution for bilinear / nearest-neighbor upsampling in SAMRefiner.
Combine with other CLIP-based methods to enhance the open vocabulary capability.