Background
- Image-level labels indicating only the presence of objects
- Class Activation Map (CAM)
- Co-occurrence(共现)
Abstract
Weakly supervised semantic segmentation (WSSS) with image-level labels aims to achieve segmentation tasks without dense annotations. However, attributed to the frequent coupling of co-occurring objects and the limited supervision from image-level labels, the challenging co-occurrence problem is widely present and leads to false activation of objects in WSSS. In this work, we devise a ‘Separate and Conquer’ scheme SeCo to tackle this issue from dimensions of image space and feature space. In the image space, we propose to ‘separate’ the co-occurring objects with image decomposition by subdividing images into patches. Importantly, we assign each patch a category tag from Class Activation Maps (CAMs), which spatially helps remove the co-context bias and guide the subsequent representation. In the feature space, we propose to ‘conquer’ the false activation by enhancing semantic representation with multi-granularity knowledge contrast. To this end, a dual-teacher-single-student architecture is designed and tag-guided contrast is conducted, which guarantee the correctness of knowledge and further facilitate the discrepancy among co-contexts. We streamline the multi-staged WSSS pipeline end-to-end and tackle this issue without external supervision. Extensive experiments are conducted, validating the efficiency of our method and the superiority over previous single-staged and even multi-staged competitors on PASCAL VOC and MS COCO. Code is available here.
Motivation
Co-occurrence of objects is inevitable, always leading to false positive pixels activated with high probability, i.e., confusing the model by error-prone feature representation. To deal with such issue, a common practice is to introduce external supervision or human prior.
So, why not seperate the coupled objects first to generate patches at the beginning?
Each patch contains single category infomation, followed by enhancing category-specific representation with dual-teacher single-student archtecture.
Overall Architecture
Methods
Given input image $\boldsymbol{I}$ containing $K$ classes of objects $\left\{Y_i\right\}\left(i = 1, 2, \cdots, K\right)$ .
decomposition
- Train an initial classification model to generate CAM seeds (auxiliary classification head in the teacher network);
- Decompose $\boldsymbol{I}$ to patch $\left\{\boldsymbol{x}_i\right\}\left(i = 1, 2, \cdots, n\right) = \mathrm{crop}\left(I\right)$ ;
- Assign CAM’s tag $t_i$ to $\boldsymbol{x}_i$ .
In the teacher’s $\lambda$ <sup>th</sup> layer:
- $\boldsymbol{W}_\lambda$ is the mapping matrix in the encoder;
- $\boldsymbol{Z}_\mathrm{F}^\lambda$ is the feature map.
Construct auxiliary pseudo mask $\boldsymbol{M}_\mathrm{aux}$ by:
Use $\mathrm{CAM_aux}$ to guide category tag allocation:
let $\boldsymbol{m}_i = \mathrm{crop}\left(\boldsymbol{M}_\mathrm{aux}\right)$ , and assign $\boldsymbol{m}_i$ to $\boldsymbol{x}_i$ .
Methods
decomposition
For representation of local patches, ViT is used to extract high-level semantics.
- Weak data augmentation for student: $\mathrm{Aug}_\mathrm{q}\left(\boldsymbol{x}\right)$ ;
- Strong data augmentation for local teacher: $\mathrm{Aug}_\mathrm{k}\left(\boldsymbol{x}\right)$ .
$g_\mathrm{q}$ for student encoder, $g_\mathrm{k}$ for local teacher encoder. Use class token in ViT to extract high-level semantics, then apply MLP to strenthen features.
decomposition spatially seperates co-occurrences, but may destruct the semantic context of patches.
Methods
representation
-
Use a global teacher to extract knowledge from the entire image.
-
Share encoder between the global teacher and the student.
-
Instead of extracting semantics based on CAMs, the global teacher uses class tokens to represent high-level semantics and obtains knowledge $\boldsymbol{P}_l\left(l = 1, 2, \cdots, K\right)$ (i.e., class semantic centroid, also mentioned in introduction, which helps push apart co-contexts), avoiding the noise from false localization of CAMs.
Self-attention mechanism gathers global sementics, avoiding the limitations of CAM based methods when taken globally (easily confused by co-occurrences). Recall that the goal of “decomposition” is to “decouple” co-occurrences.
- Adaptive updating strategy, to gather semantics across the dataset.
Note that ViT is used to gather semantics in one specific image.
Methods
representation
adaptive updating
Example: image A has a prototype $a$ for class “boat”, and image B has another prototype $b$ for class “boat”, then the cosine similarity between $a$ and $b$ is calculated. Then, this similarity score is applied softmax to get weights $W_l$ .
Prototypes for the same class from different images will contribute to the same global prototype.
Given multi-class token obtained from global teacher encoder $\boldsymbol{Z}_l$ , the updating process:
Applying exponential average (for dynamic prediction) upon knowledge and weighted token.
Methods
utilize positive prototypes
An image is cropped into $u$ patches. $\boldsymbol{q}_i$ is the local feature extracted by student. $\boldsymbol{P}_l^+$ is the positive prototype belonging to the same category with $\boldsymbol{q}_i$ .
Force $\boldsymbol{q}_i$ to be close to its corresponding global prototype (centroid).
Methods
merely usilizing positive prototypes is not enough
A category tag pool is proposed to match the memory bank.
Use a queue to capture chronological information: enqueue newest $B_\mathrm{q}, B_\mathrm{k}$ and dequeue the oldest $B_\mathrm{-q}, B_\mathrm{-k}$ . Update local teacher from student with EMA to keep the memories consistent for contrast and avoid the dramatic variance between the older memories and the newest in the reservoir.
Methods
rectify noisy tags
A similarity-based rectification strategy to denoise the tags.
Measure similarity between $\boldsymbol{q}_i$ and its history embedings.
The similarity between two patches with the same category should be significantly higher than those different.
Once the number of abnormal similarity pairs exceeds a certain proportion $σ$ , we consider $\boldsymbol{q}_i$ as a noisy embedding eventually.
If $\frac{1}{\left|R\left(\boldsymbol{x}, t_i\right)_+\right|}\sum_{\boldsymbol{k}_+ \in R\left(\boldsymbol{x}, t_i\right)} \mathbb{1}\left(\boldsymbol{q}_i^\mathsf{T}\boldsymbol{k}_+ \lt \mu\left(\boldsymbol{q}_i, t_i\right)\right) \gt \sigma$ , then $t_i \leftarrow -1$
Methods
tag guided contrast
Recall that each patch has its category tag (noisy tags already rectified).
Patch-level co-category differentiation:
- $M_f$ is rectification mask to exclude noisy patches;
- $n$ is the number of patches.
Methods
training objectives
Loss for SeCo:
For overall loss, add segmentation loss:
Experiments
comparison with SOTAs
Experiments
abation
Experiments
comparison with other cecent methods
Experiments
performance
Summary
contribution
- decomposition: patches with category tags;
- representation: combine global and local semantics together;
- two loss functions and a tag rectification method.
limitation
Patches are of fixed size. An adaptive patch sizing stategy do better.