FeatUp: A Model-Agnostic Framework for Features at Any Resolution

Motivation

NeRF (Neural Radiance Fields) [1], is a groundbreaking technique in computer vision and graphics that enables the synthesis of novel views of complex 3D scenes from a sparse set of 2D images. NeRF represents a scene as a continuous 5D function using a neural network. This function takes as input a 3D spatial location $(x, y, z)$ and a 2D viewing direction $(\theta, \varphi)$ and outputs the color (RGB) and volume density (opacity) at that point.

NeRF is based on the following simplified physical model:

Voxels (volume pixel, 体素, aka. particle in NeRF’s context) are scattered in 3D space (a cloud of light-emitting particles).
3D objects are composed of particles that emit and absorb light. (发射光 + 吸收光)
Particles emit and absorb light, but do not scatter light. (不散射光)

The core idea of NeRF is to model the scene as follows (forward pass of the MLP):

$$F: (x, y, z, \theta, \varphi) \mapsto (R, G, B, \sigma).$$

voxel coordinates: $(x, y, z) \in \mathbb{R}^3$
yaw（偏航角）: $\theta \in [0, 2\pi)$
pitch（俯仰角）: $\varphi \in [-\frac{\pi}{2}, \frac{\pi}{2}]$
roll（翻滚角）: $\psi \in [0, 2\pi)$ (uncommonly used)
color: $(R, G, B)$
volume density (i.e., opacity): $\sigma \in [0, 1]$

Yaw and pitch is similar to the setting in spherical coordinate system, it describles the ray direction from the camera center to the voxel.

One can reconstruct 3D scenes (high resolution features) from 2D images (low resolution features) based on multi-view consistency:

Modeling: Model the 3D scene with $F(x, y, z, \theta, \varphi)$ , and approximate it with an MLP. Practically, $\theta, \varphi$ is represented by a 3D unit vector $\mathbf{d}$ .
Volumetric Rendering (体素渲染) : Render 2D images from different views by integrating $F$ along camera rays. This process is called ray marching (光线步进) .

Given a ray $\mathbf{r}(t) = \mathbf{o} + t\mathbf{d}$ , where $\mathbf{o}$ is the ray origin and $\mathbf{d}$ is the ray direction, the color of the ray is computed as:

$$\hat{C}(\mathbf{r}) = \int_{t_n}^{t_f} T(t) \sigma(\mathbf{r}(t)) \mathbf{c}(\mathbf{r}(t), \mathbf{d}) \mathrm{d}t,$$

where $T(t) = \exp\left(-\int_{t_n}^{t} \sigma(\mathbf{r}(s)) \mathrm{d}s\right)$ is the accumulated transmittance from $t_n$ to $t$ , $\sigma$ is the volume density.

3. Training: Train the MLP by minimizing the difference between rendered images and input images.

See also [2] to understand how NeRF is modeled.

Ray Tracing vs. Ray Marching

Ray Marching (光线步进): Ray marching, on the other hand, is used for rendering volumetric effects by incrementally stepping along the ray and sampling the scene at each step.
Ray Tracing (光线追踪): In traditional ray tracing, rays are traced from the camera into the scene, and intersections with surfaces are computed. When a ray hits a surface, the color is determined based on the material properties and lighting at that point. This method is efficient for scenes with well-defined surfaces but struggles with volumetric effects like fog or smoke.

In 2D image processing, can one reconstruct high resolution features from low resolution features?

Given input image $x$ and model backbone $f$ and a basic transform $t$ that does not affect the backbone’s feature extraction (e.g., flipping, cropping).

Assuming that we can learn a “good” upsampler $\sigma_{\uparrow}$ , and a downsampler $\sigma_{\downarrow}$ that reproduces the low-resolution features:

$$f\left(t(x)\right) \approx t\left(f\left(x\right)\right)$$

$$f\left(t(x)\right) \approx \sigma_{\downarrow}\left(t\left(\sigma_{\uparrow}\left(f\left(x\right)\right)\right)\right)$$

By “good”, we mean the different “views” the features are invariant under different jitterring:

def apply_jitter(img, max_pad, transform_params):
    h, w = img.shape[2:]

    padded = F.pad(img, [max_pad] * 4, mode="reflect")

    zoom = transform_params["zoom"].item()
    x = transform_params["x"].item()
    y = transform_params["y"].item()
    flip = transform_params["flip"].item()

    if zoom > 1.0:
        zoomed = F.interpolate(padded, scale_factor=zoom, mode="bilinear")
    else:
        zoomed = padded

    cropped = zoomed[:, :, x:h + x, y:w + y]

    if flip:
        return torch.flip(cropped, [3])
    else:
        return cropped

Abstract

Deep features are a cornerstone of computer vision research, capturing image semantics and enabling the community to solve downstream tasks even in the zero- or few-shot regime.

However, these features often lack the spatial resolution to directly perform dense prediction tasks like segmentation and depth prediction because models aggressively pool information over large areas.

In this work, we introduce FeatUp, a task- and model-agnostic framework to restore lost spatial information in deep features.

We introduce two variants of FeatUp: one that guides features with high-resolution signal in a single forward pass, and one that fits an implicit model to a single image to reconstruct features at any resolution.

Both approaches use a multi-view consistency loss with deep analogies to NeRFs.

Our features retain their original semantics and can be swapped into existing applications to yield resolution and performance gains even without re-training.

We show that FeatUp significantly outperforms other feature upsampling and image super-resolution approaches in class activation map generation, transfer learning for segmentation and depth prediction, and end-to-end training for semantic segmentation.

Architecture

$$\mathcal{L}_{rec} = \frac{1}{|T|}\sum_{t \in T}\|f\left(t(x)\right) - \sigma_{\downarrow}\left(t(\sigma_{\uparrow}\left(f\left(x\right), x\right))\right)\|_2^2 + \log(s)$$

This pipeline does not require annotated data, and therefore it is “self-supervised”.

Methods

Downsampler

The downsampler approximates a network’s process of pooling information into features.

Blur kernel: efficient, but unable to capture nonlinear effects like dynamic receptive fields, object salience, etc.

Attention-based downsampler: $\sigma_{\downarrow}(F_{\mathrm{hr}})_{ij} = \mathrm{softmax}\left(w \odot \mathrm{Conv}\left(F_{\mathrm{hr}}\left[\Omega_{ij}\right]\right)+b\right) \cdot F_{\mathrm{hr}}\left[\Omega_{ij}\right]$

Simply put, “salience” is high-frequency feature.

Salience

Upsampler

FeatUp has two upsampler variants: JBU-based upsampler and implicit upsampler.

Joint Bilateral Upsampler (JBU)

One of the most important contribution of this work is providing a fast CUDA implementation of JBU with learned kernels, making it practical to use in real-world applications.

Traditional JBU [3]: ${\boldsymbol{S}_{\boldsymbol{p_{\uparrow}}}} = \frac{1}{k_\boldsymbol{p_{\uparrow}}}\sum_{\boldsymbol{q}_\downarrow \in \Omega}\boldsymbol{I}_{\boldsymbol{q}_\downarrow}\;f\left(\|\boldsymbol{p}_\downarrow - \boldsymbol{q}_\downarrow\|\right)\;g\left(\|{\boldsymbol{I}_\boldsymbol{p_{\downarrow}}} - {\boldsymbol{I}_\boldsymbol{q_{\downarrow}}}\|\right),$ where $f$ and $g$ are Gaussian filters (upsampling kernels).

These two upsampling kernels are computed from a window $\Omega$ in the guidance feature map $\boldsymbol{I}$ centered at pixel $\boldsymbol{p}$ :

Improved JBU: $\hat{F}_{\mathrm{hr}}[i,j]=\frac{1}{Z}\sum_{(a,b)\in\Omega}\left(F_{\mathrm{lr}}[a,b]k_{\mathrm{range}}\left(G[i,j], G[a,b]\right)k_{\mathrm{spatial}}\left([i,j],[a,b]\right)\right).$

Spatial filter kernel $f$ : $k_\mathrm{spatial}(\boldsymbol{p}, \boldsymbol{q}) = \exp\left(\frac{-\|\boldsymbol{p} - \boldsymbol{q}\|_2^2}{2\;\tau_\mathrm{spatial}^2}\right).$ More positionally distant points consistently contribute lower weights.

Range filter kernel $g$ : $k_\mathrm{range}(\boldsymbol{p}, \boldsymbol{q}) = \mathrm{softmax}_{\left(a, b\right) \in \Omega}\left(\frac{1}{\tau_{\mathrm{range}^2}}\;\mathrm{MLP}\left(G\left[i, j\right]\right)\cdot\mathrm{MLP}\left(G\left[a, b\right]\right)\right).$ More semantically dissimilar points consistently contribute lower weights.

$$\boldsymbol{F}_{\mathrm{hr}} = \left(\mathrm{JBU}\left(\cdot, \boldsymbol{x}\right) \circ \mathrm{JBU}\left(\cdot, \boldsymbol{x}\right) \circ \cdots\right)\left(f\left(\boldsymbol{x}\right)\right).$$

Implicit

The implicit upsampler is a small MLP that takes in pixel coordinates and outputs features at that location. This architecture is a direct analogue to NeRF. Fourier features are used to improve the spatial resolution of our implicit representations.

However, directly using the raw coordinates as input to the MLP often leads to poor results, especially for high-frequency details.

To address this, NeRF employs Fourier feature mapping to transform the input coordinates into a higher-dimensional space before feeding them into the MLP. This transformation allows the network to better capture fine details and variations in the scene.

In NeRF paper, it is shown that Naïve MLPs cannot catch high-frequency details

The Functionality of Fourier Features

We all know that any function that satisfies Dirichlet conditions can be represented by Fourier series:

$$F(x) = A_0 + \sum_{n=1}^\infty A_n \sin(n \omega x + \varphi_n).$$

Each sinusoidal component is of a certain frequency, the frequency identifies the “feature” in the function.
In image processing’s context, such frequency stands for how pixel values change with respect to spatial locations.

How can we utilize these components? We may construct a high-dimensional space where each dimension corresponds to a specific frequency component, then $F(x)$ can be vectorized as a point in this space. By computing the distance between points in this space, we can measure the similarity between $F(x)$ and other prototypes.

Theoretically, an ideal choice of such space would be a Hilbert space $\mathcal{H}$ , which has infinite dimensions to accomodate all possible frequencies presenting in the fourier series. In practice, this cannot be done, but there is no need to do so either, since the success of NeRF shows that about 10 such feature dimensions are sufficient to greatly improve MLP performance.

Use Fourier Positional Features to Improve MLP’s Performance

[4] demonstrates that neural networks tend to learn low-frequency components first. This behavior closely mirrors how humans perceive features at different frequencies. For example, if you power a light bulb with a hand-cranked generator and turn the handle slowly, the bulb visibly blinks. As you turn the handle faster, the blinking becomes less noticeable, and the bulb appears to stay lit. When powered by 50Hz AC electricity, the bulb theoretically flickers 100 times per second, but the human eye cannot detect this rapid change. This is because our visual system is less sensitive to high-frequency variations.

Even though we don’t consciously perceive the flicker, our brain and eyes still respond to it. Some people experience discomfort or fatigue under flickering lights, especially with certain LED or fluorescent bulbs.

A workaround is to use Fourier features to explicitly encode high-frequency components in the input space, allowing the network to learn these details more effectively, instead of learning $f(x)$ directly, we apply a mapping $\gamma(x)$ to $x$ first:

$$\gamma(x) = \left[\sin(2^0 \pi x), \cos(2^0 \pi x), \sin(2^1 \pi x), \cos(2^1 \pi x), \ldots, \sin(2^{L-1} \pi x), \cos(2^{L-1} \pi x)\right].$$

When I refer to NeRF’s implementation, I find that the Fourier feature mapping is slightly different: $\pi$ is missing in the implementation:

Copilot: This is a common simplification in code, as multiplying by $\pi$ is not strictly necessary for the encoding to work; the scaling factor can be absorbed into the network’s learned parameters.

This mapping projects $x$ into a $2L$ -dimensional space, where each dimension corresponds to a specific frequency component. Note that the high-frequency components $\sin(2 ^ {L-1} \pi x)$ and $\cos(2 ^ {L-1} \pi x)$ magnify small changes in $x$ , making it easier for the network to learn high-frequency details.

Next, we learn a function $\mathcal{F}$ that maps these Fourier features to the desired output:

$$F(x) = \mathcal{F} \circ \gamma(x).$$

Then, we learn $\mathcal{F}$ with MLP in the latent space.

This is very similar to residual learning!

Implicit Upsampler’s Implementation Details

Let $h$ be component-wise discrete Fourier transform of an input signal $z$ .

In the implicit upsampler, $\gamma(x) = \left[\sin(\mathrm{e}^{l_0} x), \cos(\mathrm{e}^{l_0} x), \sin(\mathrm{e}^{l_1} x), \cos(\mathrm{e}^{l_1} x), \ldots, \sin(\mathrm{e}^{l_n} x), \cos(\mathrm{e}^{l_n} x)\right]$ is used.

$u = [\gamma(0), \gamma(1), \ldots, \gamma(H-1)]$
$v = [\gamma(0), \gamma(1), \ldots, \gamma(W-1)]$

Let $\mathbin{:}$ be channel-wise concatenation, then the process of component-wise discrete Fourier feature mapping is:

$$h(u \mathbin{:} v \mathbin{:} z).$$

The features are then fed into a MLP to predict the features at that location.

class SimpleImplicitFeaturizer(torch.nn.Module):

    def __init__(self, n_freqs=20):
        super().__init__()
        self.n_freqs = n_freqs
        self.dim_multiplier = 2

    def forward(self, original_image):
        b, c, h, w = original_image.shape
        grid_h = torch.linspace(-1, 1, h, device=original_image.device) # 横坐标投射到 [-1, 1]
        grid_w = torch.linspace(-1, 1, w, device=original_image.device) # 纵坐标投射到 [-1, 1]
        feats = torch.cat([t.unsqueeze(0) for t in torch.meshgrid([grid_h, grid_w])]).unsqueeze(0) # meshgrid 的作用相当于 cartesian product，生成了两个大小相同的矩阵，矩阵中对应位置的元素组合在一起即可构成所有可能的坐标对
        feats = torch.broadcast_to(feats, (b, feats.shape[1], h, w))

        feat_list = [feats]
        feats = torch.cat(feat_list, dim=1).unsqueeze(1)
        freqs = torch.exp(torch.linspace(-2, 10, self.n_freqs, device=original_image.device)) \
            .reshape(1, self.n_freqs, 1, 1, 1) # 构造若干个频率
        feats = (feats * freqs) # 频率乘以横纵坐标

        feats = feats.reshape(b, self.n_freqs * self.dim_multiplier, h, w)

        all_feats = [torch.sin(feats), torch.cos(feats), original_image]

        return torch.cat(all_feats, dim=1) # 按照通道维度进行拼接，在遍历时即可得到 [sin, cos, pixel]

class IFA(torch.nn.Module):
    def __init__(self, feat_dim, num_scales=20):
        super().__init__()
        self.scales = 2 * torch.exp(torch.tensor(torch.arange(1, num_scales + 1)))
        self.feat_dim = feat_dim
        self.sin_feats = SimpleImplicitFeaturizer() # 傅里叶位置变换
        self.mlp = nn.Sequential(
            nn.Conv2d(feat_dim + (num_scales * 4) + 2, feat_dim, 1),
            nn.BatchNorm2d(feat_dim),
            nn.LeakyReLU(),
            nn.Conv2d(feat_dim, feat_dim, 1),
        )

    def forward(self, source, guidance):
        b, c, h, w = source.shape
        up_source = F.interpolate(source, (h * 2, w * 2), mode="nearest")
        assert h == w
        lr_cord = torch.linspace(0, h, steps=h, device=source.device)
        hr_cord = torch.linspace(0, h, steps=2 * h, device=source.device)
        lr_coords = torch.cat([x.unsqueeze(0) for x in torch.meshgrid(lr_cord, lr_cord)], dim=0).unsqueeze(0)
        hr_coords = torch.cat([x.unsqueeze(0) for x in torch.meshgrid(hr_cord, hr_cord)], dim=0).unsqueeze(0)
        up_lr_coords = F.interpolate(lr_coords, (h * 2, w * 2), mode="nearest")
        coord_diff = up_lr_coords - hr_coords
        coord_diff_feats = self.sin_feats(coord_diff) # 傅里叶位置变换
        c2 = coord_diff_feats.shape[1]
        bcast_coord_feats = torch.broadcast_to(coord_diff_feats, (b, c2, h * 2, w * 2))
        return self.mlp(torch.cat([up_source, bcast_coord_feats], dim=1))  # + up_source 送入 MLP

Experiments

Visualization

Comparisons

CAM: Class Activation Map

A.D.: Average Drop
A.I.: Average Increase

Intuitively, A.D. and A.I. capture how much an image’s most salient region changes the classification output.

For more details, refer to [5].

Improvement in Downstream Tasks

JBU based upsampler improves feature quality with relatively fewer parameters and computation cost.

Limitations

JBU-based upsampler is not very sensitive to fine-grained details.

Such limitations are related to the receptive field of the JBU’s spatial kernel. Increasing the kernel size can help mitigate these issues, but it also increases computational cost. [6]

Ablation Studies

These ablation studies in the supplementary material were included in response to reviewer feedback.

MLP
Replacing the softmax with a gaussian kernel w.r.t. euclidean dist. (original JBU formulation)
Replacing the softmax with a gaussian kernel w.r.t. cosine dist. (modification of JBU)

Downsampler (simple vs. attention)
Presence of outlier detection in loss function
TV regularization

Summary

Contributions in This Paper

A model- and task-agnostic framework to upsample deep features to any resolution.
A fast CUDA implementation of JBU with learned kernels.
Drop-in replacements for ordinary features to improve performance on dense prediction tasks and model explainability.

一些启发

了解了计算机视觉的 3D 重建方向，follow 了一些有意思的相关工作，对一些常见的技术（傅里叶变换）产生了更加深刻的理解；
3D 重建的一些做法可以迁移到 2D 图像处理上，准备看一些相关的工作，如：Pri3d: Can 3d priors help 2d representation learning? [7]
方法不是关键，而是要把握为何 xx 方法取得了好的效果，最好是能够从理论层面理解“效果好”的根本原因，否则就变成了“炼丹经验交流会”；
看数学教材和 NeurlIPS、ICML 这些偏理论性的会议论文时，常常会一头雾水，学到了一个理论，却不知道其在现实中的深刻意义。在看顶会的偏应用的的工作时，可以着重去看一下这些工作引用的来自于其他顶会的理论性工作，能够拨云见日。