Xi’an Jiaotong University1 Xidian University2
Note
In ZSL, new classes are identified solely based on the pre-defined word embeddings constituted from \(C_\text{B}\).
Open Vocabulary Learning is proposed to handle the above issue, by providing additional visual language data (e.g., image captions) as auxiliary supervision.
CLIP: a vision-language foundation model with remarkable zero-shot capability, playing an important role in Open Vocabulary Semantic Segmentation (OVSS).
In practice, CLIP is often employed as an encoder. To exploit the zero-shot generalization capability of CLIP, researchers mainly focus on designing intricate decoders to accomodate pixel-level perception.
CLIP-based OVSS
💀 Spatial Resolution vs. Semantic Quality: Deep features often sacrifice spatial resolution for semantic quality. [1]
💀 Global Bias: CLIP was trained by image-level alignment, and the globally aligned features are not well-suited for dense prediction tasks like semantic segmentation, which puts more emphasis on local features. [2]
Note
The prediction head (decoder) is able to upsample LR feature maps into HR predictions. CLIP focuses only on global \([\mathrm{CLS}]\) tokens, and even though patch-level tokens can be generated, they are inevitably comtaminated by global bias, which is detrimental to dense prediction.
💀 Remote-Sensing-specific issue: Unlike natural images, RS images are sensitive to low resolution features, previous solutions designed for natural images are sub-optimal.
Limitations of previous state-of-the-art OVSS methods
Remote sensing image plays an irreplaceable role in fields such as agriculture, water resources, military, and disaster relief. Pixel-level interpretation is a critical aspect of remote sensing image applications; however, a prevalent limitation remains the need for extensive manual annotation. For this, we try to introduce open-vocabulary semantic segmentation (OVSS) into the remote sensing context. However, due to the sensitivity of remote sensing images to low-resolution features, distorted target shapes and ill-fitting boundaries are exhibited in the prediction mask. To tackle this issue, we propose a simple and general upsampler, SimFeatUp, to restore lost spatial information in deep features in a training-free style. Further, based on the observation of the abnormal response of local patch tokens to [CLS] token in CLIP, we propose to execute a straightforward subtraction operation to alleviate the global bias in patch tokens. Extensive experiments are conducted on 17 remote sensing datasets spanning semantic segmentation, building extraction, road detection, and flood detection tasks. Our method achieves an average of 5.8%, 8.2%, 4%, and 15.3% improvement over state-of-the-art methods on 4 tasks.
Tip
Supplementary explanation about the functionality of SimFeatUp:
FeatUp is an model-agnostic upsampler aims at reconstructing low resolution features. [1]
FeatUp Training Architecture
Training paradigm:
\[\mathcal{L}_{\mathrm{reconstruct}} = \|\boldsymbol{X} - \sigma_{\downarrow}\left(\sigma_{\uparrow}\left(\boldsymbol{X}\right)\right)\|_2^2.\]
A major objective of this study is to develop a plug-and-play module capable of upsampling the encoder’s output without disrupting the original processing pipeline.
If upsampling is performed at the end of the encoding stage:
To address this, the authors propose performing upsampling at an earlier stage of the encoder. Specifically, given an encoder with \(N\) layers, the output of the \((N-1)\)-th layer, denoted as \(X_{N-1}\), is subjected to a projection operation in advance, followed by upsampling. The resulting feature map is then used as input to the final layer of the encoder:
\[X'_{N-1} = \sigma_{\uparrow}\left(\mathrm{Proj}\left(X_{N-1}\right)\right).\]
Joint Bilateral Upsampling (JBU) was introduced in 2007, predating the widespread adoption of deep learning techniques.
In scenarios where the raw high-resolution image \(\boldsymbol{\tilde{I}}\) exceeds the computational capacity for direct processing, a downsampling operation is applied to reduce its size, yielding a low-resolution representation \(\boldsymbol{S}\). Following the completion of processing, the low-resolution output \(\boldsymbol{S}\) must be upsampled to reconstruct the high-resolution solution \(\boldsymbol{\tilde{S}}\).
For a given position \(\boldsymbol{p}\) in \(\boldsymbol{\tilde{S}}\), the high-resolution output is computed as:
\[\tilde{\boldsymbol{S}_{\boldsymbol{p}}} = \frac{1}{k_\boldsymbol{p}}\sum_{\boldsymbol{q}_\downarrow \in \Omega}\boldsymbol{S}_{\boldsymbol{q}_\downarrow}\;f\left(\|\boldsymbol{p}_\downarrow - \boldsymbol{q}_\downarrow\|\right)\;g\left(\|\tilde{\boldsymbol{I}_\boldsymbol{p}} - \tilde{\boldsymbol{I}_\boldsymbol{q}}\|\right),\]
where:
This formulation ensures that the upsampling process preserves both spatial and intensity-based consistency, leveraging the joint bilateral filtering mechanism.
FeatUp’s implementation:
Spatial filter kernel: \(k_\mathrm{spatial}(\boldsymbol{p}, \boldsymbol{q}) = \exp\left(\frac{-\|\boldsymbol{p} - \boldsymbol{q}\|_2^2}{2\;\tau_\mathrm{spatial}^2}\right).\)
Range filter kernel: \(k_\mathrm{range}(\boldsymbol{p}, \boldsymbol{q}) = \mathrm{softmax}_{\left(a, b\right) \in \Omega}\left(\frac{1}{\tau_{\mathrm{range}^2}}\;\mathrm{MLP}\left(G\left[i, j\right]\right)\cdot\mathrm{MLP}\left(G\left[a, b\right]\right)\right).\)
[1] introduced two distinct architectures for the FeatUp upsampler: one leveraging the previously discussed Joint Bilateral Upsampling (JBU) technique, and the other employing an implicit deep network.
Two FeatUp architectures
The JBU-based upsampler utilizes a stack of parameterized JBU modules (each module has its independent parameters) to reconstruct high-resolution feature maps. Specifically, given an original image \(\boldsymbol{x}\) and its corresponding low-resolution feature map \(f(\boldsymbol{x})\), the high-resolution feature map \(\boldsymbol{F}_\mathrm{hr}\) is obtained through the following iterative process:
\[\boldsymbol{F}_\mathrm{hr} = \left(\mathrm{JBU}\left(\cdot, \boldsymbol{x}\right) \circ \mathrm{JBU}\left(\cdot, \boldsymbol{x}\right) \circ \cdots\right)\left(f\left(\boldsymbol{x}\right)\right).\]
Important
JBU-based upsamplers impose strong spatial priors to accurately recover lost spatial information, wheras implicit upsamplers recover by learning high quality features.
Both \(\sigma_{\uparrow}\) and \(\sigma_{\downarrow}\) are parameterized and learnable, which introduces a relatively weak constraint. This flexibility allows \(\sigma_{\uparrow}\) and \(\sigma_{\downarrow}\) to be optimized in a manner that minimizes the overall loss. However, this optimization may lead to high-resolution (HR) images generated by \(\sigma_{\uparrow}\) that are inconsistent with their corresponding low-resolution (LR) counterparts.
To address this issue, an additional image-level loss is introduced:
\[\mathcal{L}_{\mathrm{image}} = \|\boldsymbol{I} - \mathrm{CRN}\left(\sigma_{\uparrow}\left(\mathcal{O}\left[1:hw + 1\right]\right)\right)\|_2^2,\]
where \(\mathrm{CRN}\) denotes a lightweight content retention network designed to preserve content consistency.
The total loss function for constructing SimFeatUp is then defined as:
\[\mathcal{L} = \mathcal{L}_{\mathrm{reconstruct}} + \mu\;\mathcal{L}_{\mathrm{image}}.\]
Specifically, CRN consists of two 2D convolutional layers with activation and a Tanh activation layer, where the Tanh layer is designed to constrain the output to [-1, 1], cf. VAEs [4].
The original JBU-based FeatUp has too many paramters, and it is because independently parameterized JBU modules are stacked too many times (this is not a problem for training-required settings), which brings too much uncertainty for training-free settings since each JBU’s behavior is indeterminable. With such an insight, the authors simplify JBU_Stack to JBU_One, i.e., merely one parameterized JBU is used for upsampling. However, if more upsampling is required, one can still execute JBU_One several times.
Important
In the ablation study, it is found that JBU_One reduces paramters by nearly 4 times while delivering a slight IoU gain.
In view of the varying object scale spanning in remote sensing images, the upsampling kernel is raised up from \(7 \times 7\) to \(11 \times 11\). The concern here is a larger kernel size may bring more irrelavant context, but \(k_\mathrm{spatial}\) has such a property that the larger \(\|\boldsymbol{p} - \boldsymbol{q}\|\) is, the smaller weight it contributes to the obtained HR features.
In the last layer of CLIP, given input: \(\boldsymbol{X} = \left[\boldsymbol{x}_\mathrm{cls}, \boldsymbol{x}_1, \boldsymbol{x}_2, \cdots, \boldsymbol{x}_{hw}\right]^\mathsf{T} \in \mathbb{R}^{(hw + 1) \times d}\);
\[\boldsymbol{y} = \boldsymbol{X} + \mathrm{softmax}\left(\frac{\boldsymbol{q}\;\boldsymbol{k}^\mathsf{T}}{\sqrt{d}}\right)\;\boldsymbol{v}.\]
\[\boldsymbol{z} = \boldsymbol{y} + \mathrm{FeedForwardNet}\left(\mathrm{LayerNormalization}\left(\boldsymbol{y}\right)\right).\]
Output:
\[\mathcal{O} = \mathrm{Proj}\left(\boldsymbol{z}\right) = \left[\boldsymbol{o}_\mathrm{cls}, \boldsymbol{o}_1, \cdots, \boldsymbol{o}_c\right]^\mathsf{T} \in \mathbb{R}^{(hw + 1) \times c}.\]
The learnable global token \(\boldsymbol{x}_\mathrm{cls}\) captures the aggregated information of the sequence, which aims at balancing the global information and local information. This is ignorable for classification tasks, but detrimental to dense classification tasks like segmentation.
Note
Before ClearCLIP [5], people believe self-attention is the culprit of “noise”.
Prior studies have sought to mitigate the “noise” inherent in CLIP through three primary strategies, all of which involve modifications to the final layer of the model:
Tip
Modulated attention in SegEarth-OV (self-self attention):
\[\mathrm{M\text{-}SA} = \sum_{\boldsymbol{i} \in \left\{\boldsymbol{q}, \boldsymbol{k}, \boldsymbol{v}\right\}}\mathrm{softmax}\left(\frac{\boldsymbol{i}\;\boldsymbol{i}^\mathsf{T}}{\sqrt{d}}\right)\;\boldsymbol{v}.\]
The image is recognized as the building, which is reasonable because the building covers the maximum range in the image.
The highly responsive regions are not only the regions with buildings, some roads and pavements are also activated, which indicates that the global bias contaminates the local patch tokens.
Just substract global \(\left[\mathrm{CLS}\right]\) token:
\[\hat{\mathcal{O}} = \mathcal{O}\left[1 : hw + 1\right] - \lambda\;\mathcal{O\left[0\right]},\]
where \(\lambda\) is intensity factor.
Tip
\(\mathrm{O}\left[1 : hw + 1\right]\) retrieves elements \(\boldsymbol{o_1}, \cdots, \boldsymbol{o}_{hw}.\)
pretrain: CLIP (ViT-B/16)
Comparison with previous SOTAs
Note
Images of larger size allows the upsampler to preserve more spatial information.
Plug-and-play
This study advances the application of open-vocabulary semantic segmentation methods, originally designed for natural images, to the domain of remote sensing by addressing three (or two) critical challenges unique to this context. It adapts and integrates existing OVSS methodologies to effectively handle remote sensing segmentation tasks for the first time. The primary contributions of this work are as follows: