Post

๐Ÿ”Ž CLIP Surgery: A Closer Look at the Explainability of Contrastive Language-Image Pre-training

๐Ÿ”Ž CLIP Surgery: A Closer Look at the Explainability of Contrastive Language-Image Pre-training

๐Ÿ”Ž (English) CLIP Surgery: Enhancing Explainability by Operating on CLIP!

Image


๐Ÿš€ Key Summary of CLIP Surgery

One-liner: โ€œWithout extra training, only structural and feature surgeries enhance CLIPโ€™s explainability!โ€

1) Proving CLIPโ€™s Issues: Found problems of inconsistency in self-attention and redundant features.

2) Feature Surgery: Successfully removed redundant features and suppressed unnecessary noisy activation โ†’ produced clean CAMs!!

3) Training-Free Explainability: No fine-tuning required, explainability secured with original CLIP โ†’ versatile applications!!


๐Ÿ” Research Background

  • CAM, Grad-CAM: Effective for CNN/ViT but fail on CLIP.

    In CLIP they are noisy and produce opposite visualization. In other words, localization fails!!
    Image

  • Why do they fail in CLIP?
    • Because self-attention links inconsistent semantic regions, and redundant features emphasize background rather than foreground.

    a. Why does self-attention link inconsistent semantic regions?
    a-1. CLIP was only trained for global imageโ€“text matching, so attention didnโ€™t need to focus precisely on object interiors.
    a-2. CLIPโ€™s Query, Key, and Value parameters are different (heterologous), so Q/K relations connect inconsistent semantic areas.

    A_raw = ฯƒ(s ยท QK_โŠค)V : heterologous parameters
    A_con = ฯƒ(s ยท VV_โŠค)V : homogeneous parameters

    b. Why do redundant features cause noise?
    b-1. CLIP trains on many categories at once, so shared features (e.g., โ€œskyโ€, โ€œgrassโ€, โ€œroadโ€) frequently appear.
    b-2. These generic features are often in the background, so self-attention is easily pulled toward them, causing noisy activation.
    Image

  • Alignment-based approaches exist but require extra models, layers, or fine-tuning (not training-free).
    • ECLIP: Realigns CLIP features with segmentation masks using self-supervision.
    • RCLIP: Uses bounding box annotations to refine CLIPโ€™s imageโ€“text features per object.
    • Both require retraining (fine-tuning).

๐Ÿงฑ CLIP Surgery Architecture

Image

i) Architecture Surgery (Fixing Structural Issues)

  • Raw self-attention (i-1) connects inconsistent semantic regions.
  • Consistent self-attention (i-2) prevents unnecessary background emphasis.

mFSR measures how much self-attention focuses on the foreground (object).
Image

i-1) raw self-attention : A_raw = ฯƒ(s ยท QK_T)V
i-2) consistent self-attention : A_con = ฯƒ(s ยท VV_โŠค)V

  • Code example: Transformer Attention forward pass
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
    # i-1) Raw Self-attention
    def forward(self, x):
        B, N, C = x.shape
        qkv = self.qkv(x).reshape(B, N, 3, self.num_heads, C // self.num_heads).permute(2, 0, 3, 1, 4)
        q, k, v = qkv[0], qkv[1], qkv[2]

        # original self-attention for the original path
        attn_ori = (q @ k.transpose(-2, -1)) * self.scale
        attn_ori = attn_ori.softmax(dim=-1)
        attn_ori = self.attn_drop(attn_ori)

        # i-2) consistent Self-attention  
        # replace k & q by v
        k = v
        q = k
        attn = (q @ k.transpose(-2, -1)) * scale
        attn = (attn).softmax(dim=-1)
        attn = self.attn_drop(attn)

        ## return both paths  
        x_ori = (attn_ori @ v).transpose(1, 2).reshape(B, N, C)
        x = (attn @ v).transpose(1, 2).reshape(B, N, C) # clip_surgery
        x = self.proj_drop(self.proj(x))
        x_ori = self.proj_drop(self.proj(x_ori))
        return [x, x_ori]
  • Dual paths: one for CLIP embeddings, another for CAM generation. FFN skipped to avoid negative effects.

Image

  • In the figure, FFN inside CLIP Transformer focuses incorrectly, and early self-attention blocks are also inaccurate.
  • Thus, in the new path, only self-attention is applied (FFN skipped).
  • The original path is kept to preserve embeddings.

  • Code example: ResidualAttentionBlock forward
1
2
3
4
5
6
7
8
9
10
11
    def forward(self, x):
        # dual paths for blocks deeper than "d"
        if isinstance(self.attn, Attention):
            if isinstance(x, list):
                x, x_ori = x ## x_ori = original path, x = new path
                x_res = self.attention(self.ln_1(x_ori))
                x_res, x_ori_res = x_res ## consistent vs raw self-attention
                x_ori += x_ori_res 
                x_ori = x_ori + self.mlp(self.ln_2(x_ori)) # original path adds FFN
                x += x_res # new path only adds consistent self-attention
                return [x, x_ori]

ii) Feature Surgery (Fixing Representational Issues)

  • CLIP learns many categories at once, leading to redundant shared features.
  • Example: when target = โ€œdogโ€, embeddings of cat, sky, sea, airplane also overlap.

    Small L1 distance means positive and empty overlap โ†’ redundancy problem!!
    Image

  • Code: clip_feature_surgery subtracts redundant_feats from all features.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
# in demp.py, define all_texts and target_texts
all_texts = [...]
target_texts = ['dog']

with torch.no_grad():
    image_features = model.encode_image(image)
    image_features = image_features / image_features.norm(dim=-1, keepdim=True)
    
    text_features = clip.encode_text_with_prompt_ensemble(model, all_texts, device)

    similarity = clip.clip_feature_surgery(image_features, text_features)
    similarity_map = clip.get_similarity_map(features[:, 1:, :], cv2_img.shape[:2])

    for b in range(similarity_map.shape[0]):
        for n in range(similarity_map.shape[-1]):
            if all_texts[n] not in target_texts:
                continue
            vis = (similarity_map[b, :, :, n].cpu().numpy() * 255).astype('uint8')
            vis = cv2.applyColorMap(vis, cv2.COLORMAP_JET)
            vis = cv2_img * 0.4 + vis * 0.6
            vis = cv2.cvtColor(vis.astype('uint8'), cv2.COLOR_BGR2RGB)
            print('CLIP:', all_texts[n])
            plt.imshow(vis)
            plt.show()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
## clip.py clip_feature_surgery implementation
def clip_feature_surgery(image_features, text_features, redundant_feats=None, t=2):

    if redundant_feats != None:
        similarity = image_features @ (text_features - redundant_feats).t()

    else:
        prob = image_features[:, :1, :] @ text_features.t()
        prob = (prob * 2).softmax(-1)
        w = prob / prob.mean(-1, keepdim=True)

        b, n_t, n_i, c = image_features.shape[0], text_features.shape[0], image_features.shape[1], image_features.shape[2]
        feats = image_features.reshape(b, n_i, 1, c) * text_features.reshape(1, 1, n_t, c)
        feats *= w.reshape(1, 1, n_t, 1)
        redundant_feats = feats.mean(2, keepdim=True) # along cls dim
        feats = feats - redundant_feats
        
        similarity = feats.sum(-1)

    return similarity

๐Ÿงช Experimental Results

๐ŸŽฏ Explainability Benchmarks

  • On VOC 2012, COCO, PascalContext:
    • mIoU +22โ€“36% improvement
    • mSC +48โ€“65% improvement

๐ŸŽฏ Open-Vocabulary Tasks

  • Semantic Segmentation: best training-free method (PascalContext mIoU 29.3%)
  • Multi-label Recognition: +11.61% mAP over CLIP on NUS-Wide
  • Interactive Segmentation: replace manual labels by converting text โ†’ points for SAM
  • Multimodal Visualization: interpret CLIPโ€™s training itself
    Image
    • [end] token most often activated; non-object words like โ€œinโ€, โ€œ.โ€, โ€œofโ€ also highly active!!
    • Suggests redundant tokens in CLIPโ€™s vocabulary.
    • Provides ideas for improving CLIP training in the future.

๐Ÿ‘€ Qualitative Comparison

Image

  • Original CLIP: emphasizes background + noisy
  • CLIP Surgery: sharp, object-focused heatmaps
  • โ†’ Clear improvement vs Grad-CAM, Bi-Modal, gScoreCAM

๐Ÿงช Ablation Study

Image

  • Only Architecture Surgery (i) โ†’ mSC +47.88%
  • Add Feature Surgery (ii) โ†’ extra +3.17%
  • Without dual paths โ†’ collapse occurs, proving itโ€™s essential.

โœ… Conclusion

  • CLIP Surgery solves CLIPโ€™s fundamental explainability issues (opposite visualization, noisy activation).
  • A training-free approach strengthens CAM-based interpretation.
  • Directly applicable to downstream tasks like Semantic Segmentation, Multi-label Recognition, Interactive Segmentation.
  • Provides key insights for understanding CLIP internals and guiding future model improvements.

๐Ÿ”Ž (ํ•œ๊ตญ์–ด) CLIP Surgery: CLIP์„ ์ˆ˜์ˆ ํ•ด์„œ ์„ค๋ช… ๊ฐ€๋Šฅ์„ฑ์„ ๋†’์ด๋‹ค!

Image

  • ์ œ๋ชฉ: A Closer Look at the Explainability of Contrastive Language-Image Pre-training (CLIP Surgery)
  • ์ €๋„: Pattern Recognition (2025)
  • ์ฝ”๋“œ: GitHub โ€“ CLIP Surgery
  • ํ•ต์‹ฌ ํ‚ค์›Œ๋“œ: CLIP, Explainability, CAM, Vision-Language, Open-Vocabulary
  • ์š”์•ฝ: CLIP์€ ๊ฐ•๋ ฅํ•œ ๋น„์ „-์–ธ์–ด ๋ชจ๋ธ์ด์ง€๋งŒ, foreground ๋Œ€์‹  background์— ์ง‘์ค‘ํ•˜๊ฑฐ๋‚˜ ์žก์Œ ํ™œ์„ฑํ™”(noisy activation) ๋ฌธ์ œ๊ฐ€ ์กด์žฌ. ์ด๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด Architecture Surgery์™€ Feature Surgery๋ฅผ ์ ์šฉํ•ด ์„ค๋ช…๊ฐ€๋Šฅ์„ฑ์„ ํฌ๊ฒŒ ๊ฐœ์„ ํ•˜๋Š” ํ”„๋ ˆ์ž„์›Œํฌ ์ œ์•ˆ!

๐Ÿš€ CLIP Surgery ํ•ต์‹ฌ ์š”์•ฝ

ํ•œ ์ค„ ์š”์•ฝ: โ€œ์ถ”๊ฐ€ ํ•™์Šต ์—†์ด, ๊ตฌ์กฐยทํŠน์ง• ์ˆ˜์ˆ ๋งŒ์œผ๋กœ CLIP์˜ ์„ค๋ช…๊ฐ€๋Šฅ์„ฑ์„ ๊ฐ•ํ™”ํ•œ๋‹ค!โ€

1) CLIP ์˜ ๋ฌธ์ œ ์ฆ๋ช… : ๊ธฐ์กด self-attention์˜ ๋ถˆ์ผ๊ด€์„ฑ, ์ค‘๋ณต ํŠน์ง•์˜ ๋ฌธ์ œ๋“ค์„ ๋ฐœ๊ฒฌ

2) Feature Surgery : ์ค‘๋ณต(redundant) ํŠน์ง• ์ œ๊ฑฐ ๋ฐ ๋ถˆํ•„์š”ํ•œ noisy activation ์–ต์ œ ์„ฑ๊ณตํ•ด์„œ CAM ๋งŒ๋“ฌ!!

3) Training-Free Explainability : Fine-tuning์ด ๋ถˆํ•„์š”, ์›๋ณธ CLIP ๊ทธ๋Œ€๋กœ ์„ค๋ช…๊ฐ€๋Šฅ์„ฑ ํ™•๋ณดํ•˜์˜€๊ธฐ์— ๋‹ค์–‘ํ•˜๊ฒŒ ํ™œ์šฉ ๊ฐ€๋Šฅ!!


๐Ÿ” ๊ธฐ์กด ์—ฐ๊ตฌ์˜ ํ๋ฆ„

  • CAM, Grad-CAM: CNN/ViT์—๋Š” ํšจ๊ณผ์ ์ด์ง€๋งŒ CLIP์—๋Š” ์ ์šฉ ๋ถˆ๊ฐ€

    CLIP์—์„œ๋Š” โ€˜noisyโ€™ํ•˜๊ณ , โ€˜Opposite visualizationโ€™ํ•˜๋‹ค. ์ฆ‰ localization์— ๋ฌธ์ œ๊ฐ€์žˆ๋‹ค!! Image

  • ๊ทธ๋Ÿผ, CLIP์—์„œ๋Š” ์™œ ์•ˆ๋ฌ์„๊นŒ?
    • Self-attention ๊ตฌ์กฐ๊ฐ€ ์ผ๊ด€๋˜์ง€ ์•Š์€ ์˜๋ฏธ ์˜์—ญ์„ ์—ฐ๊ฒฐํ•˜๊ณ , ์ค‘๋ณต ํŠน์ง• ๋•Œ๋ฌธ์— ์žก์Œ์ด ๋ฐœ์ƒํ•ด foreground ๋Œ€์‹  background๋ฅผ ๊ฐ•์กฐํ•˜๊ธฐ ๋•Œ๋ฌธ
      a. Self-attention ๊ตฌ์กฐ๊ฐ€ ์ผ๊ด€๋˜์ง€ ์•Š์€ ์˜๋ฏธ ์˜์—ญ์„ ์—ฐ๊ฒฐํ•˜๋Š” ์ด์œ ๋Š”? a-1. CLIP์€ ์ด๋ฏธ์ง€โ€“ํ…์ŠคํŠธ ์Œ์˜ ์ „์—ญ์ (global) ๋งค์นญ๋งŒ ํ•™์Šตํ–ˆ๊ธฐ์— attention์ด ์„ธ๋ฐ€ํ•˜๊ฒŒ ๊ฐ์ฒด ๋‚ด๋ถ€์—๋งŒ ์ง‘์ค‘ํ•  ํ•„์š”๊ฐ€ ์—†์—ˆ์Œ!!
      a-2. CLIP์˜ Query, Key, Value ํŒŒ๋ผ๋ฏธํ„ฐ๊ฐ€ ์„œ๋กœ ๋‹ฌ๋ผ์„œ(heterologous parameters) Query/Key๊ฐ€ ๋งŒ๋“  ๊ด€๊ณ„๊ฐ€ ์ผ๊ด€๋˜์ง€ ์•Š์€ ์˜๋ฏธ ์˜์—ญ์„ ์—ฐ๊ฒฐ

      A_raw = ฯƒ(s ยท QK_โŠค)V : heterologous parameters
      A_con = ฯƒ(s ยท VV_โŠค)V : homogeneous parameters

    b. ์ค‘๋ณต ํŠน์ง• ๋•Œ๋ฌธ์— ์žก์Œ์ด ๋ฐœ์ƒํ•˜๋Š” ์ด์œ ๋Š”? b-1. CLIP์€ ๋งŽ์€ ์นดํ…Œ๊ณ ๋ฆฌ๋ฅผ ๋™์‹œ์— ํ•™์Šต, ํด๋ž˜์Šค ๊ฐ„ ๊ณต์œ ๋˜๋Š” ํŠน์ง•(์˜ˆ: โ€œํ•˜๋Š˜โ€, โ€œํ’€โ€, โ€œ๋„๋กœโ€)์ด ์ž์ฃผ ๋“ฑ์žฅ b-2. ๋ฒ”์šฉ์  ํŠน์ง•์ด ๋ฐฐ๊ฒฝ์— ์ฃผ๋กœ ๊น”๋ ค ์žˆ์–ด์„œ, self-attention์ด ์‰ฝ๊ฒŒ ๋ฐฐ๊ฒฝ์œผ๋กœ ๋Œ๋ ค๊ฐ€ noisy activation์ด ๋ฐœ์ƒ Image

  • Alignment ๊ธฐ๋ฐ˜ ๊ธฐ๋ฒ•๋„ ์žˆ์ง€๋งŒ ์ด๋Š” ์ถ”๊ฐ€์ ์ธ ๋ชจ๋ธ, ๋ ˆ์ด์–ด, ํ˜น์€ ํŒŒ์ธํŠœ๋‹์„ ํ•„์š”๋กœํ•จ!(Not Traininig Free)
    • ECLIP: CLIP feature์™€ segmentation mask๋ฅผ self-supervised๋กœ ๋‹ค์‹œ ์ •๋ ฌ(alignment)
      • ์›๋ž˜ CLIP์ด localization์„ ์ง์ ‘ ๋ชปํ•˜๋ฏ€๋กœ, mask ์ •๋ณด๋ฅผ ์ถ”๊ฐ€ ํ•™์Šตํ•˜์—ฌ ๋ณด์™„
    • RCLIP: Bounding box annotation์„ ํ™œ์šฉํ•ด CLIP์˜ ์ด๋ฏธ์ง€โ€“ํ…์ŠคํŠธ feature๋ฅผ ๊ฐ์ฒด ๋‹จ์œ„๋กœ ๋ณด์ •
      • ๊ฒฐ๊ตญ CLIP์„ ์žฌํ•™์Šต(fine-tuning)ํ•˜๋Š” ๋ฐฉ์‹

๐Ÿงฑ CLIP Surgery ๊ตฌ์กฐ (Architecture)

Image

i) Architecture Surgery(๊ตฌ์กฐ์  ๋ฌธ์ œ ๊ฐœ์„ )

  • Raw Self-attention(i-1)์ด ์ผ๊ด€๋˜์ง€ ์•Š์€ ์˜๋ฏธ ์˜์—ญ์„ ์—ฐ๊ฒฐํ•˜๋Š” ๋ฌธ์ œ๊ฐ€ ์žˆ๋Š”๋ฐ,
  • Consistent self-attention(i-2)์œผ๋กœ ๋ถˆํ•„์š”ํ•œ ๋ฐฐ๊ฒฝ ๊ฐ•์กฐ ๋ฐฉ์ง€

mFSR์€ Self-Attention์ด ์–ผ๋งˆ๋‚˜ foreground(๊ฐ์ฒด)์— ์ง‘์ค‘ํ–ˆ๋Š”๊ฐ€๋ฅผ ์ธก์ •ํ•˜๋Š” ์ง€ํ‘œ Image

i-1) raw self-attention : A_raw = ฯƒ(s ยท QK_T)V i-2) consistent self-attention : A_con = ฯƒ(s ยท VV_โŠค)V

  • ์ด๋ฅผ ์ฝ”๋“œ๋กœ ๋ณด๋ฉด Transformer Attention ๋ถ€๋ถ„์ธ Attention forward ๋ถ€๋ถ„์—์„œ,
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    
    # i-1) Raw Self-attention
    def forward(self, x):
        B, N, C = x.shape
        qkv = self.qkv(x).reshape(B, N, 3, self.num_heads, C // self.num_heads).permute(2, 0, 3, 1, 4)
        q, k, v = qkv[0], qkv[1], qkv[2]
    
        # original self-attention for the original path
        attn_ori = (q @ k.transpose(-2, -1)) * self.scale
        attn_ori = attn_ori.softmax(dim=-1)
        attn_ori = self.attn_drop(attn_ori)
    
        # i-2) consistent Self-attention  
        # replace k & q by v
        k = v
        q = k
        attn = (q @ k.transpose(-2, -1)) * scale
        attn = (attn).softmax(dim=-1)
        attn = self.attn_drop(attn)
    
        ## ๋งˆ๋ฌด๋ฆฌ!! ๋‘˜๋‹ค ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋„๋ก return  
        x_ori = (attn_ori @ v).transpose(1, 2).reshape(B, N, C)
        x = (attn @ v).transpose(1, 2).reshape(B, N, C) # clip_surgery
        #x = v.transpose(1, 2).reshape(B, N, C) # mask_clip
        x = self.proj_drop(self.proj(x))
        x_ori = self.proj_drop(self.proj(x_ori))
        return [x, x_ori]
    
  • Dual Paths๋กœ CLIP ์ž„๋ฒ ๋”ฉ์šฉ๊ณผ CAM ์ด๋ฏธ์ง€ ์ œ์ž‘์šฉ ๋‘๊ฐœ ํ”ผ์ฒ˜ ์ถ”์ถœ, FFN์˜ ๋ถ€์ •์  ์˜ํ–ฅ ์ตœ์†Œํ™”

    Image

    • ์œ„ ์ด๋ฏธ์ง€๋ฅผ ๋ณด๋ฉด, CLIP Transformer๋‚ด์—์„œ FFN์€ ์˜คํžˆ๋ ค ์ด์ƒํ•œ๋Œ€๋ฅผ ์ง‘์ค‘ํ•˜๊ณ , ์ดˆ๊ธฐ Self-Attention ๋ธ”๋ก ๋ถ€๋ถ„๋„ ๋ถ€์ •ํ™•ํ•˜๋‹ค!
    • ๊ทธ๋ ‡๊ธฐ์— ์ƒˆ ๊ฒฝ๋กœ์—์„œ๋Š” self-attention๋งŒ ์ ์šฉํ•˜๊ณ  FFN์€ ์Šค๋‚!!
    • ํ•œํŽธ ์›๋ž˜ ๊ฒฝ๋กœ ๋„ ์œ ์ง€ํ•ด์„œ CLIP ๋ณธ๋ž˜ ์ž„๋ฒ ๋”ฉ ๋ณด์กด

    • ์ด๋ฅผ ์ฝ”๋“œ๋กœ ๋ณด๋ฉด, Transformer layer์ธ ResidualAttentionBlock forward ๋ถ€๋ถ„์—์„œ,
      1
      2
      3
      4
      5
      6
      7
      8
      9
      10
      11
      
      def forward(self, x):
          # dual paths for blocks deeper than "d"
          if isinstance(self.attn, Attention):
              if isinstance(x, list):
                  x, x_ori = x ## x_ori๋Š” ์›๋ž˜๊ฒฝ๋กœ, x ๋Š” ์ƒˆ๋กœ์šด ๊ฒฝ๋กœ!
                  x_res = self.attention(self.ln_1(x_ori))
                  x_res, x_ori_res = x_res ## self-attention ๊ฒฐ๊ณผ ๋ฐ˜ํ™˜. x_res ๋Š” `consistent self-attention`, x_ori_res๋Š” `raw self-attention`  
                  x_ori += x_ori_res # ์›๋ž˜ ๊ฒฝ๋กœ๋Š” self attention(x_ori_res)์„ ๋”ํ•˜๊ณ 
                  x_ori = x_ori + self.mlp(self.ln_2(x_ori)) # ์›๋ž˜ ๊ฒฝ๋กœ์— FFN(x_ori)๋ฅผ ๋”ํ•จ!!
                  x += x_res # ์ƒˆ๋กœ์šด ๊ฒฝ๋กœ๋Š” self attention(x_ori) ๋งŒ ๋”ํ•จ
                  return [x, x_ori]
      

ii) Feature Surgery(ํ‘œํ˜„์  ๋ฌธ์ œ ๊ฐœ์„ )

  • CLIP์€ ๋งŽ์€ ์นดํ…Œ๊ณ ๋ฆฌ๋ฅผ ๋™์‹œ์— ํ•™์Šต, ํด๋ž˜์Šค ๊ฐ„ ๊ณต์œ ๋˜๋Š” ํŠน์ง•์ด ๋งŽ์€ ๋ฌธ์ œ๊ฐ€ ์žˆ๋Š”๋ฐ,
  • ๊ฐ•์•„์ง€ ๊ฐ€ ํƒ€๊ฒŸ์ด๋”๋ผ๋„ ๊ณ ์–‘์ด, ํ•˜๋Š˜, ๋ฐ”๋‹ค, ๋น„ํ–‰๊ธฐ ๋“ฑ ๊ธฐํƒ€ ํ…์ŠคํŠธ์˜ ์ž„๋ฒ ๋”ฉ๋„ ํ•ด๋ณด๊ณ , ์ค‘๋ณต๋˜๋Š” ๋ถ€๋ถ„์„ ์ œ๊ฑฐ

    L1 ์ด ์ž‘์œผ๋ฉด ์œ ์‚ฌํ•˜๋‹ค๋Š” ๋œป. ์ฆ‰ positive๋ž‘ empty๊ฐ€ ์œ ์‚ฌํ•ด๋ฒ„๋ฆผ! ๊ทธ๋ž˜์„œ ์ค‘๋ณต์˜ ๋ฌธ์ œ๊ฐ€ ๋ฐœ์ƒ!! Image

  • ์ฝ”๋“œ๋กœ ๋ณด๋ฉด!! clip_feature_surgery ๋ถ€๋ถ„์—์„œ ์ „์ฒด์—์„œ ๊ฒน์น˜๋Š” ๋ถ€์˜๋ฅผ ๊ฐ€์ง€๊ณ  redundant_feats๋ฅผ ๋งŒ๋“ค๊ณ  ๊ฐ feats์—์„œ redundant_feats๋ฅผ ๋นผ์คŒ
    ```python

    demp.py ์—์„œ ๋ฏธ๋ฆฌ empty์— ํ•ด๋‹นํ•˜๋Š” all_text๋ฅผ ์„ ์–ธํ•ด๋‘๊ณ  target_text๋„ ๋‘”๋‹ค์Œ

    all_texts = [โ€˜airplaneโ€™, โ€˜bagโ€™, โ€˜bedโ€™, โ€˜bedclothesโ€™, โ€˜benchโ€™, โ€˜bicycleโ€™, โ€˜birdโ€™, โ€˜boatโ€™, โ€˜bookโ€™, โ€˜bottleโ€™, โ€˜buildingโ€™, โ€˜busโ€™, โ€˜cabinetโ€™, โ€˜carโ€™, โ€˜catโ€™, โ€˜ceilingโ€™, โ€˜chairโ€™, โ€˜clothโ€™, โ€˜computerโ€™, โ€˜cowโ€™, โ€˜cupโ€™, โ€˜curtainโ€™, โ€˜dogโ€™, โ€˜doorโ€™, โ€˜fenceโ€™, โ€˜floorโ€™, โ€˜flowerโ€™, โ€˜foodโ€™, โ€˜grassโ€™, โ€˜groundโ€™, โ€˜horseโ€™, โ€˜keyboardโ€™, โ€˜lightโ€™, โ€˜motorbikeโ€™, โ€˜mountainโ€™, โ€˜mouseโ€™, โ€˜personโ€™, โ€˜plateโ€™, โ€˜platformโ€™, โ€˜potted plantโ€™, โ€˜roadโ€™, โ€˜rockโ€™, โ€˜sheepโ€™, โ€˜shelvesโ€™, โ€˜sidewalkโ€™, โ€˜signโ€™, โ€˜skyโ€™, โ€˜snowโ€™, โ€˜sofaโ€™, โ€˜tableโ€™, โ€˜trackโ€™, โ€˜trainโ€™, โ€˜treeโ€™, โ€˜truckโ€™, โ€˜tv monitorโ€™, โ€˜wallโ€™, โ€˜waterโ€™, โ€˜windowโ€™, โ€˜woodโ€™] target_texts = [โ€˜dogโ€™]

with torch.no_grad(): # Extract image features image_features = model.encode_image(image) image_features = image_features / image_features.norm(dim=-1, keepdim=True)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
# Prompt ensemble for text features with normalization
text_features = clip.encode_text_with_prompt_ensemble(model, all_texts, device)

# Similarity map from image tokens with min-max norm and resize, B,H,W,N
# ์—ฌ๊ธฐ์„œ clip_feature_surgery ์ง„ํ–‰!! 
similarity = clip.clip_feature_surgery(image_features, text_features)
similarity_map = clip.get_similarity_map(features[:, 1:, :], cv2_img.shape[:2])

for b in range(similarity_map.shape[0]):
    for n in range(similarity_map.shape[-1]):
        if all_texts[n] not in target_texts:
            continue
        ## ์—ฌ๊ธฐ์„œ 
        vis = (similarity_map[b, :, :, n].cpu().numpy() * 255).astype('uint8')
        vis = cv2.applyColorMap(vis, cv2.COLORMAP_JET)
        vis = cv2_img * 0.4 + vis * 0.6
        vis = cv2.cvtColor(vis.astype('uint8'), cv2.COLOR_BGR2RGB)
        print('CLIP:', all_texts[n])
        plt.imshow(vis)
        plt.show() ```
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
## clip.py clip_feature_surgery ๋ถ€๋ถ„์—์„œ ์ „์ฒด์—์„œ ๊ฒน์น˜๋Š” ๋ถ€์˜๋ฅผ ๊ฐ€์ง€๊ณ  redundant_feats๋ฅผ ๋งŒ๋“ค๊ณ  ๊ฐ feats์—์„œ redundant_feats๋ฅผ ๋นผ์คŒ  
def clip_feature_surgery(image_features, text_features, redundant_feats=None, t=2):

    if redundant_feats != None:
        similarity = image_features @ (text_features - redundant_feats).t()

    else:
        # weights to restrain influence of obvious classes on others
        prob = image_features[:, :1, :] @ text_features.t()
        prob = (prob * 2).softmax(-1)
        w = prob / prob.mean(-1, keepdim=True)

        # element-wise multiplied features
        b, n_t, n_i, c = image_features.shape[0], text_features.shape[0], image_features.shape[1], image_features.shape[2]
        feats = image_features.reshape(b, n_i, 1, c) * text_features.reshape(1, 1, n_t, c)
        feats *= w.reshape(1, 1, n_t, 1)
        redundant_feats = feats.mean(2, keepdim=True) # along cls dim
        feats = feats - redundant_feats
        
        # sum the element-wise multiplied features as cosine similarity
        similarity = feats.sum(-1)

    return similarity


๐Ÿงช ์‹คํ—˜ ํ‰๊ฐ€ ๊ฒฐ๊ณผ

๐ŸŽฏ Explainability Benchmarks

  • VOC 2012, COCO, PascalContext ๋“ฑ์—์„œ
    • mIoU +22~36% ๊ฐœ์„ 
    • mSC +48~65% ๊ฐœ์„ 

๐ŸŽฏ Open-Vocabulary Tasks

  • Semantic Segmentation: training ์—†๋Š” ๋ฐฉ๋ฒ• ์ค‘ ์ตœ๊ณ  ์„ฑ๋Šฅ (์˜ˆ: PascalContext mIoU 29.3%)
  • Multi-label Recognition: NUS-Wide์—์„œ CLIP ๋Œ€๋น„ +11.61% mAP ํ–ฅ์ƒ
  • Interactive Segmentation: SAM์— ํ…์ŠคํŠธโ†’ํฌ์ธํŠธ ๋ณ€ํ™˜์œผ๋กœ ์ˆ˜์ž‘์—… ๋ผ๋ฒจ ๋Œ€์ฒด
  • Multimodal Visualization: CLIP์˜ ํ•™์Šต ๊ณผ์ • ์ž์ฒด ํ•ด์„ ๊ฐ€๋Šฅ
    Image
    • [end] ํ† ํฐ์ด ๊ฐ€์žฅ ํ”ํžˆ ํ™œ์„ฑํ™”๋œ ํ…์ŠคํŠธ ํ† ํฐ์ด๋ฉฐ, โ€œinโ€, โ€œ.โ€, โ€œofโ€ ๊ฐ™์€ ๋น„๊ฐ์ฒด(non-object) ๋‹จ์–ด๋„ ๋†’์€ ๋ฐ˜์‘์„ ๋ณด์ž„!!
    • ์ด๋Š” CLIP์˜ ์–ดํœ˜ ์‚ฌ์ „์— ๋ถˆํ•„์š”ํ•œ ์ค‘๋ณต ํ† ํฐ(redundant tokens)์ด ์กด์žฌํ•จ์„ ์‹œ์‚ฌ
    • ํ–ฅํ›„ CLIP ํ•™์Šต ๊ณผ์ • ๊ฐœ์„  ์•„์ด๋””์–ด๋ฅผ ์ œ๊ณต!!

๐Ÿ‘€ ์ •์„ฑ ๋น„๊ต : ์ž˜ํ•˜๋Š”๊ตฌ๋งŒ!

Image

  • ์›๋ณธ CLIP: background ๊ฐ•์กฐ + ์žก์Œ ๋‹ค์ˆ˜
  • CLIP Surgery: ์„ ๋ช…ํ•˜๊ณ  ๊ฐ์ฒด ์ค‘์‹ฌ heatmap ์ƒ์„ฑ
  • โ†’ ๊ธฐ์กด Grad-CAM, Bi-Modal, gScoreCAM ๋Œ€๋น„ ํ™•์—ฐํžˆ ํ–ฅ์ƒ๋œ ์‹œ๊ฐํ™” ํ’ˆ์งˆ

๐Ÿงช Ablation ๋ถ„์„

Image

  • Architecture Surgery(i)๋งŒ ์ ์šฉ โ†’ mSC +47.88%
  • Feature Surgery(ii) ์ถ”๊ฐ€ โ†’ ์ถ”๊ฐ€ +3.17% ํ–ฅ์ƒ
  • Dual Paths ์—†์œผ๋ฉด collapse ๋ฐœ์ƒ, ํ•ต์‹ฌ ๋ชจ๋“ˆ์ž„์„ ๊ฒ€์ฆ

โœ… ๊ฒฐ๋ก 

  • CLIP Surgery๋Š” CLIP ๋ชจ๋ธ์˜ ๊ทผ๋ณธ์  ์„ค๋ช…๊ฐ€๋Šฅ์„ฑ ๋ฌธ์ œ(opposite visualization, noisy activation)๋ฅผ ํ•ด๊ฒฐ
  • ์ถ”๊ฐ€ ํ•™์Šต ์—†๋Š” training-free ์ ‘๊ทผ์œผ๋กœ CAM ๊ธฐ๋ฐ˜ ํ•ด์„ ๊ฐ•ํ™”
  • Semantic Segmentation, Multi-label Recognition, Interactive Segmentation ๋“ฑ ๋‹ค์–‘ํ•œ ๋‹ค์šด์ŠคํŠธ๋ฆผ ์ž‘์—…์— ์ง์ ‘ ํ™œ์šฉ ๊ฐ€๋Šฅ
  • CLIP ๋‚ด๋ถ€ ๋ฉ”์ปค๋‹ˆ์ฆ˜ ์ดํ•ด์™€ ํ–ฅํ›„ ๋ชจ๋ธ ๊ฐœ์„ ์— ์ค‘์š”ํ•œ ํ†ต์ฐฐ ์ œ๊ณต
This post is licensed under CC BY 4.0 by the author.