Post

๐Ÿ”Ž ClipSurgery Hands-on: ClipSurgery ์„ ์‹ค์Šตํ•ด๋ณด์ž!

๐Ÿ”Ž ClipSurgery Hands-on: ClipSurgery ์„ ์‹ค์Šตํ•ด๋ณด์ž!

๐Ÿงฌ (English) Practicing ClipSurgery!!

Today we will practice with the CLIP model that has been surgically modified โ€”
ClipSurgery!

Weโ€™ll run it directly and generate a clean similarity map (CAM) on top of an image!!


โœ… Environment Setup!!

  • First, clone the Clip Surgery Git repo:
    1
    
    git clone https://github.com/xmed-lab/CLIP_Surgery.git
    
  • This includes clip.py, clip_surgery_model.py, and also demo.ipynb.
  • We will customize this demo.ipynb.

  • Additionally:
  • Python โ‰ฅ 3.9, CUDA recommended (but CPU works too)
  • Required libraries: torch, opencv-python, numpy, Pillow, matplotlib, torchvision
  • From the CLIP_Surgery repo (or equivalent module) we need:
    • clip.load("CS-ViT-B/16", ...) (the surgically modified vision backbone)
    • encode_text_with_prompt_ensemble
    • clip_feature_surgery
    • get_similarity_map
  • One image file for visualization (e.g. dog.jpg)

๐Ÿง  What will we do?

  1. Extract image token features with the CS-ViT-B/16 model (with Architecture Surgery),
  2. Build stable class (text) embeddings via Prompt Ensemble,
  3. Apply Feature Surgery to remove redundant/common class features,
  4. Generate and visualize a foreground-focused similarity map.

๐Ÿงช Letโ€™s Start!! โ€“ Full Code

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
import clip
import torch
import cv2
import numpy as np
from PIL import Image
from  matplotlib import pyplot as plt
from torchvision.transforms import Compose, Resize, ToTensor, Normalize
from torchvision.transforms import InterpolationMode
BICUBIC = InterpolationMode.BICUBIC
# from segment_anything import sam_model_registry, SamPredictor

# 0) Device setup
device = "cuda" if torch.cuda.is_available() else "cpu"

# 1) (Optional) Load original CLIP โ€” for comparison
model, _ = clip.load("ViT-B/16", device=device)
model.eval()

# 2) Preprocessing pipeline
preprocess = Compose([
    Resize((224, 224), interpolation=BICUBIC),
    ToTensor(),
    Normalize((0.48145466, 0.4578275, 0.40821073),
              (0.26862954, 0.26130258, 0.27577711))
])

# 3) Load input image
pil_img = Image.open(f"/{my_path}/dog.jpg" )
cv2_img = cv2.cvtColor(np.array(pil_img), cv2.COLOR_RGB2BGR)
image = preprocess(pil_img).unsqueeze(0).to(device)

# 4) Class dictionary โ€” Feature Surgery requires multiple classes
all_texts = [
    'airplane', 'bag', 'bed', 'bedclothes', 'bench', 'bicycle', 'bird', 'boat',
    'book', 'bottle', 'building', 'bus', 'cabinet', 'car', 'cat', 'ceiling',
    'chair', 'cloth', 'computer', 'cow', 'cup', 'curtain', 'dog', 'door',
    'fence', 'floor', 'flower', 'food', 'grass', 'ground', 'horse', 'keyboard',
    'light', 'motorbike', 'mountain', 'mouse', 'person', 'plate', 'platform',
    'potted plant', 'road', 'rock', 'sheep', 'shelves', 'sidewalk', 'sign',
    'sky', 'snow', 'sofa', 'table', 'track', 'train', 'tree', 'truck',
    'tv monitor', 'wall', 'water', 'window', 'wood'
]
target_texts = ['dog']

# 5) Load surgically modified architecture (CS-ViT-B/16)
model, preprocess_unused = clip.load("CS-ViT-B/16", device=device)
model.eval()

with torch.no_grad():
    # (A) Image features (per token, including CLS)
    image_features = model.encode_image(image)                    # [B, 1+HW, C]
    image_features = image_features / image_features.norm(dim=-1, keepdim=True)

    # (B) Prompt ensemble-based text features
    text_features = clip.encode_text_with_prompt_ensemble(model, all_texts, device)  # [N, C]

    # (C) Feature Surgery โ€” remove redundant features
    similarity = clip.clip_feature_surgery(image_features, text_features)            # [B, 1+HW, N]

    # (D) Generate similarity map from patch tokens only + upsample
    similarity_map = clip.get_similarity_map(similarity[:, 1:, :], cv2_img.shape[:2])  # [B, H, W, N]

    # (E) Visualization โ€” Overlay target class only
    for b in range(similarity_map.shape[0]):
        for n in range(similarity_map.shape[-1]):
            if all_texts[n] not in target_texts:
                continue
            vis = (similarity_map[b, :, :, n].cpu().numpy() * 255).astype('uint8')
            vis = cv2.applyColorMap(vis, cv2.COLORMAP_JET)
            vis = cv2_img * 0.4 + vis * 0.6
            vis = cv2.cvtColor(vis.astype('uint8'), cv2.COLOR_BGR2RGB)
            print('CLIP Surgery:', all_texts[n])
            plt.imshow(vis)
            plt.axis('off')
            plt.show()

Running this code immediately segments the dog quite well!

Image


๐Ÿ” Code Explanation (Key Points)

1) CS-ViT-B/16: Backbone with Architecture Surgery

  • In several final blocks, set q=k=v(=V) to perform Consistent Self-Attention.
  • Introduce a Dual Path:
    • The CAM path skips the FFN, reducing background/noise influence.
    • The original CLIP path is preserved to maintain embedding quality.

2) clip.encode_text_with_prompt_ensemble for Prompt Ensemble

  • Using only "a photo of a {}." is unstable, so multiple templates are averaged and normalized โ†’ stable class embeddings.
  • Example templates (in clip.py):
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    
    prompt_templates = ['a bad photo of a {}.', 'a photo of many {}.', 'a sculpture of a {}.', 
    'a photo of the hard to see {}.', 'a low resolution photo of the {}.', 'a rendering of a {}.',
    'graffiti of a {}.', 'a bad photo of the {}.', 'a cropped photo of the {}.', 'a tattoo of a {}.',
    'the embroidered {}.', 'a photo of a hard to see {}.', 'a bright photo of a {}.',
    'a photo of a clean {}.', 'a photo of a dirty {}.', 'a dark photo of the {}.',
    'a drawing of a {}.', 'a photo of my {}.', 'the plastic {}.', 'a photo of the cool {}.',
    'a close-up photo of a {}.', 'a black and white photo of the {}.', 'a painting of the {}.',
    'a painting of a {}.', 'a pixelated photo of the {}.', 'a sculpture of the {}.',
    'a bright photo of the {}.', 'a cropped photo of a {}.', 'a plastic {}.',
    'a photo of the dirty {}.', 'a jpeg corrupted photo of a {}.', 'a blurry photo of the {}.',
    'a photo of the {}.', 'a good photo of the {}.', 'a rendering of the {}.',
    'a {} in a video game.', 'a photo of one {}.', 'a doodle of a {}.', 
    'a close-up photo of the {}.', 'a photo of a {}.', 'the origami {}.', 
    'the {} in a video game.', 'a sketch of a {}.', 'a doodle of the {}.', 
    'a origami {}.', 'a low resolution photo of a {}.', 'the toy {}.', 
    'a rendition of the {}.', 'a photo of the clean {}.', 'a photo of a large {}.', 
    'a rendition of a {}.', 'a photo of a nice {}.', 'a photo of a weird {}.', 
    'a blurry photo of a {}.', 'a cartoon {}.', 'art of a {}.', 'a sketch of the {}.', 
    'a embroidered {}.', 'a pixelated photo of a {}.', 'itap of the {}.', 
    'a jpeg corrupted photo of the {}.', 'a good photo of a {}.', 'a plushie {}.', 
    'a photo of the nice {}.', 'a photo of the small {}.', 'a photo of the weird {}.', 
    'the cartoon {}.', 'art of the {}.', 'a drawing of the {}.', 'a photo of the large {}.', 
    'a black and white photo of a {}.', 'the plushie {}.', 'a dark photo of a {}.', 
    'itap of a {}.', 'graffiti of the {}.', 'a toy {}.', 'itap of my {}.', 
    'a photo of a cool {}.', 'a photo of a small {}.', 'a tattoo of the {}.', 
    'there is a {} in the scene.', 'there is the {} in the scene.', 
    'this is a {} in the scene.', 'this is the {} in the scene.', 'this is one {} in the scene.']
    

3) Feature Surgery

  • Compute element-wise product of image tokens ร— text embeddings.
  • Estimate class-common features (mean across classes) โ†’ subtract them โ†’ focus on foreground.
  • Summing produces similarity tensor [B, 1+HW, N].
  • In clip.py:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
def clip_feature_surgery(image_features, text_features, redundant_feats=None, t=2):

    if redundant_feats != None:
        similarity = image_features @ (text_features - redundant_feats).t()

    else:
        # weights to restrain influence of obvious classes on others
        prob = image_features[:, :1, :] @ text_features.t()
        prob = (prob * 2).softmax(-1)
        w = prob / prob.mean(-1, keepdim=True)

        # element-wise multiplied features
        b, n_t, n_i, c = image_features.shape[0], text_features.shape[0], image_features.shape[1], image_features.shape[2]
        feats = image_features.reshape(b, n_i, 1, c) * text_features.reshape(1, 1, n_t, c)
        feats *= w.reshape(1, 1, n_t, 1)
        redundant_feats = feats.mean(2, keepdim=True) # along cls dim
        feats = feats - redundant_feats
        
        # sum the element-wise multiplied features as cosine similarity
        similarity = feats.sum(-1)

    return similarity

4) Building the Similarity Map

  • Exclude CLS (no spatial info), use only patch tokens (HW).
  • Normalize each channel with Minโ€“Max, upsample with bilinear to original image size.
  • Select target class channel (dog) and overlay as a heatmap.

๐Ÿงญ Limitations (Finding Failure Cases)

Still, part segmentation inside the object is not solved.
Image

In some cases it seems to work!?
Image

  • Results vary depending on the image and the target object.

โœ… Summary

This practice showed that we can greatly improve CLIPโ€™s explainability without additional training.
Key ideas:

  • (i) Consistent Self-Attention + Dual Path reduces structural issues,
  • (ii) Feature Surgery removes redundant features, making the foreground stand out clearly.

๐Ÿงฌ (ํ•œ๊ตญ์–ด) ClipSurgery ์„ ์‹ค์Šต!!

์˜ค๋Š˜์€ CLIP๋ชจ๋ธ์˜ ๋‚ด๋ถ€๊ตฌ์กฐ๋ฅผ ์ˆ˜์ˆ ํ•œ!! ClipSurgery ์˜ ํŒŒ์ด์ฌ ์‹ค์Šต์„ ํ•ด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค!!

์ง์ ‘ ์‹คํ–‰ํ•ด ๋ณด๋ฉด์„œ, ์ด๋ฏธ์ง€ ์œ„์— ๊นจ๋—ํ•œ similarity map(CAM) ์„ ๊ทธ๋ฆฌ๋Š” ๊ณผ์ •์„ ํ•จ๊นจ ํ•ด ๋ณด์•„์š”!!


โœ… ํ™˜๊ฒฝ์„ธํŒ…!!

  • ์šฐ์„ !! Clip Surgery Git repo๋ฅผ ๋‹ค์šด๋ฐ›์Šต๋‹ˆ๋‹ค!!
    1
    
    git clone https://github.com/xmed-lab/CLIP_Surgery.git
    
  • ๊ทธ๋Ÿผ clip.py์™€ clip_surgery_model.py ๊ฐ€ ๋“ค์ด์—ˆ๊ณ !! demo.ipynb๋„ ์žˆ์ง€์š”~
  • ์ €ํฌ๋Š” ์ด demo.ipynb ๋ฅผ ์ปค์Šคํ„ฐ๋งˆ์ด์ง• ์‹œ์ผœ๋ณด์•˜์Šต๋‹ˆ๋‹ค!!

  • ์ถ”๊ฐ€๋กœ
  • Python โ‰ฅ 3.9, CUDA ํ™˜๊ฒฝ ๊ถŒ์žฅ(์—†์–ด๋„ CPU๋กœ ๋™์ž‘)
  • ํ•„์ˆ˜ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ: torch, opencv-python, numpy, Pillow, matplotlib, torchvision
  • CLIP_Surgery ๋ ˆํฌ(ํ˜น์€ ๋™์ผ ๊ธฐ๋Šฅ์˜ ๋ชจ๋“ˆ)์—์„œ ์ œ๊ณตํ•˜๋Š”:
    • clip.load("CS-ViT-B/16", ...) (์ˆ˜์ˆ ๋œ ๋น„์ „ ๋ฐฑ๋ณธ)
    • encode_text_with_prompt_ensemble
    • clip_feature_surgery
    • get_similarity_map
  • ์‹œ๊ฐํ™”ํ•  ์ด๋ฏธ์ง€ ํŒŒ์ผ 1์žฅ (์˜ˆ: dog.jpg)

๐Ÿง  ๋ฌด์—‡์„ ํ•˜๊ฒŒ ๋˜๋‚˜?

  1. Architecture Surgery๊ฐ€ ์ ์šฉ๋œ CS-ViT-B/16 ๋ชจ๋ธ๋กœ ์ด๋ฏธ์ง€ ํ† ํฐ ํŠน์ง•์„ ๋ฝ‘๊ณ ,
  2. Prompt Ensemble๋กœ ํด๋ž˜์Šค(ํ…์ŠคํŠธ) ํŠน์ง•์„ ์•ˆ์ •์ ์œผ๋กœ ๋งŒ๋“  ๋’ค,
  3. Feature Surgery๋กœ ํด๋ž˜์Šค ๊ณตํ†ต(์ค‘๋ณต) ํŠน์ง•์„ ์ œ๊ฑฐํ•˜์—ฌ
  4. foreground์— ์ง‘์ค‘ํ•˜๋Š” similarity map์„ ์ƒ์„ฑ/์‹œ๊ฐํ™”ํ•ฉ๋‹ˆ๋‹ค.

๐Ÿงช ๋ฐ”๋กœ ์‹œ์ž‘!! - ์ „์ฒด ์ฝ”๋“œ

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
import clip
import torch
import cv2
import numpy as np
from PIL import Image
from  matplotlib import pyplot as plt
from torchvision.transforms import Compose, Resize, ToTensor, Normalize
from torchvision.transforms import InterpolationMode
BICUBIC = InterpolationMode.BICUBIC
# from segment_anything import sam_model_registry, SamPredictor

# 0) ๋””๋ฐ”์ด์Šค ์„ค์ •
device = "cuda" if torch.cuda.is_available() else "cpu"

# 1) (์„ ํƒ) ์›๋ณธ CLIP ๋กœ๋”ฉ โ€” ๋น„๊ต์šฉ
model, _ = clip.load("ViT-B/16", device=device)
model.eval()

# 2) ์ „์ฒ˜๋ฆฌ ํŒŒ์ดํ”„๋ผ์ธ
preprocess = Compose([
    Resize((224, 224), interpolation=BICUBIC),
    ToTensor(),
    Normalize((0.48145466, 0.4578275, 0.40821073),
              (0.26862954, 0.26130258, 0.27577711))
])

# 3) ์ž…๋ ฅ ์ด๋ฏธ์ง€ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ
pil_img = Image.open(f"/{my_path}/dog.jpg" )
cv2_img = cv2.cvtColor(np.array(pil_img), cv2.COLOR_RGB2BGR)
image = preprocess(pil_img).unsqueeze(0).to(device)

# 4) ํด๋ž˜์Šค ์‚ฌ์ „ โ€” Feature Surgery๋Š” ์—ฌ๋Ÿฌ ํด๋ž˜์Šค๋ฅผ ํ•จ๊ป˜ ๋ด์•ผ ๊ณตํ†ต(์ค‘๋ณต) ์„ฑ๋ถ„์„ ์ œ๊ฑฐํ•  ์ˆ˜ ์žˆ์Œ
all_texts = [
    'airplane', 'bag', 'bed', 'bedclothes', 'bench', 'bicycle', 'bird', 'boat',
    'book', 'bottle', 'building', 'bus', 'cabinet', 'car', 'cat', 'ceiling',
    'chair', 'cloth', 'computer', 'cow', 'cup', 'curtain', 'dog', 'door',
    'fence', 'floor', 'flower', 'food', 'grass', 'ground', 'horse', 'keyboard',
    'light', 'motorbike', 'mountain', 'mouse', 'person', 'plate', 'platform',
    'potted plant', 'road', 'rock', 'sheep', 'shelves', 'sidewalk', 'sign',
    'sky', 'snow', 'sofa', 'table', 'track', 'train', 'tree', 'truck',
    'tv monitor', 'wall', 'water', 'window', 'wood'
]
target_texts = ['dog']

# 5) ์ˆ˜์ˆ ๋œ ์•„ํ‚คํ…์ฒ˜(CS-ViT-B/16) ๋กœ๋”ฉ
model, preprocess_unused = clip.load("CS-ViT-B/16", device=device)
model.eval()

with torch.no_grad():
    # (A) ์ด๋ฏธ์ง€ ํŠน์ง• (ํ† ํฐ ๋‹จ์œ„) โ€” CLS ํฌํ•จ
    image_features = model.encode_image(image)                    # [B, 1+HW, C]
    image_features = image_features / image_features.norm(dim=-1, keepdim=True)

    # (B) Prompt ensemble ๊ธฐ๋ฐ˜ ํ…์ŠคํŠธ ํŠน์ง• โ€” ํด๋ž˜์Šค๋ณ„ ์•ˆ์ •ํ™”๋œ ์ž„๋ฒ ๋”ฉ
    text_features = clip.encode_text_with_prompt_ensemble(model, all_texts, device)  # [N, C]

    # (C) Feature Surgery โ€” ํด๋ž˜์Šค ๊ณตํ†ต(์ค‘๋ณต) ์„ฑ๋ถ„ ์ œ๊ฑฐ๋œ ์œ ์‚ฌ๋„ ๊ณ„์‚ฐ
    similarity = clip.clip_feature_surgery(image_features, text_features)            # [B, 1+HW, N]

    # (D) ํŒจ์น˜(์œ„์น˜)๋งŒ ์‚ฌ์šฉํ•˜์—ฌ similarity map ์ƒ์„ฑ + ์›๋ณธ ํฌ๊ธฐ๋กœ ์—…์ƒ˜ํ”Œ
    similarity_map = clip.get_similarity_map(similarity[:, 1:, :], cv2_img.shape[:2])  # [B, H, W, N]

    # (E) ์‹œ๊ฐํ™” โ€” ํƒ€๊นƒ ํด๋ž˜์Šค๋งŒ Overlay
    for b in range(similarity_map.shape[0]):
        for n in range(similarity_map.shape[-1]):
            if all_texts[n] not in target_texts:
                continue
            vis = (similarity_map[b, :, :, n].cpu().numpy() * 255).astype('uint8')
            vis = cv2.applyColorMap(vis, cv2.COLORMAP_JET)
            vis = cv2_img * 0.4 + vis * 0.6
            vis = cv2.cvtColor(vis.astype('uint8'), cv2.COLOR_BGR2RGB)
            print('CLIP Surgery:', all_texts[n])
            plt.imshow(vis)
            plt.axis('off')
            plt.show()

์œ„ ์ฝ”๋“œ๋ฅผ ํ•ด๋ณด๋ฆฌ๋ฉด!! ๋ฐ”๋กœ dog์— ๋Œ€ํ•ด์„œ segmentation์„ ์ž˜ ํ•ด๋ฒ„๋ฆฝ๋‹ˆ๋‹ค!!

Image


๐Ÿ” ์ฝ”๋“œ ํ•ด์„ค (ํ•ต์‹ฌ ํฌ์ธํŠธ)

1) CS-ViT-B/16: Architecture Surgery๊ฐ€ ์ ์šฉ๋œ ๋ฐฑ๋ณธ - clip_surgery_model.py์— ๋ฐ”๋€ ๋‚ด์šฉ์ด์žˆ์Šต๋‹ˆ๋‹ค! ์„ธ๋ถ€ ๋ถ„์„์€ ํŽ˜์ดํผ ๋ฆฌ๋”ฉ ํฌ์ŠคํŒ…์„ ์ฐธ๊ณ !!

  • ๋งˆ์ง€๋ง‰ ์—ฌ๋Ÿฌ ๋ธ”๋ก์—์„œ q=k=v(=V)๋กœ ๋ฐ”๊ฟ” Consistent Self-Attention์„ ์ˆ˜ํ–‰ํ•˜๊ณ ,
  • Dual Path๋กœ CAM์šฉ ๊ฒฝ๋กœ๋Š” FFN์„ ์Šคํ‚ตํ•˜์—ฌ ๋ฐฐ๊ฒฝ/๋…ธ์ด์ฆˆ ์˜ํ–ฅ์„ ์ค„์ž…๋‹ˆ๋‹ค.
  • ์›๋ž˜(CLIP) ๊ฒฝ๋กœ๋Š” ๊ทธ๋Œ€๋กœ ์œ ์ง€ํ•ด ๋ณธ๋ž˜ ์ž„๋ฒ ๋”ฉ ์„ฑ๋Šฅ ๋ณด์กด.

2) clip.encode_text_with_prompt_ensemble ์—์„œ ์ง„ํ–‰๋˜๋Š” Prompt Ensemble

  • "a photo of a {}." ํ•˜๋‚˜๋กœ๋งŒ ํ•˜๋ฉด ๋ฐ˜์‘์ด ๋‹ค๋ฅผ์ˆ˜ ์žˆ์œผ๋‹ˆ ๊ทธ ์™ธ์—๋„ ๋‹ค์–‘ํ•œ ํ…œํ”Œ๋ฆฟ์œผ๋กœ ํ…์ŠคํŠธ ์ž„๋ฒ ๋”ฉ์„ ํ‰๊ท /์ •๊ทœํ™” โ†’ ํด๋ž˜์Šค ์ž„๋ฒ ๋”ฉ ์•ˆ์ •ํ™”.
  • ๋‹ค์–‘ํ•œ ํ”„๋กฌํฌํŠธ๋Š”?? ์•„๋ž˜์™€ ๊ฐ™์•˜์Šต๋‹ˆ๋‹ค!!(clip.py์— ์žˆ์Œ!!)
    ```python prompt_templates = [โ€˜a bad photo of a {}.โ€™, โ€˜a photo of many {}.โ€™, โ€˜a sculpture of a {}.โ€™, โ€˜a photo of the hard to see {}.โ€™, โ€˜a low resolution photo of the {}.โ€™, โ€˜a rendering of a {}.โ€™, โ€˜graffiti of a {}.โ€™, โ€˜a bad photo of the {}.โ€™, โ€˜a cropped photo of the {}.โ€™, โ€˜a tattoo of a {}.โ€™, โ€˜the embroidered {}.โ€™, โ€˜a photo of a hard to see {}.โ€™, โ€˜a bright photo of a {}.โ€™, โ€˜a photo of a clean {}.โ€™, โ€˜a photo of a dirty {}.โ€™, โ€˜a dark photo of the {}.โ€™, โ€˜a drawing of a {}.โ€™, โ€˜a photo of my {}.โ€™, โ€˜the plastic {}.โ€™, โ€˜a photo of the cool {}.โ€™, โ€˜a close-up photo of a {}.โ€™, โ€˜a black and white photo of the {}.โ€™, โ€˜a painting of the {}.โ€™, โ€˜a painting of a {}.โ€™, โ€˜a pixelated photo of the {}.โ€™, โ€˜a sculpture of the {}.โ€™, โ€˜a bright photo of the {}.โ€™, โ€˜a cropped photo of a {}.โ€™, โ€˜a plastic {}.โ€™, โ€˜a photo of the dirty {}.โ€™, โ€˜a jpeg corrupted photo of a {}.โ€™, โ€˜a blurry photo of the {}.โ€™, โ€˜a photo of the {}.โ€™, โ€˜a good photo of the {}.โ€™, โ€˜a rendering of the {}.โ€™, โ€˜a {} in a video game.โ€™, โ€˜a photo of one {}.โ€™, โ€˜a doodle of a {}.โ€™, โ€˜a close-up photo of the {}.โ€™, โ€˜a photo of a {}.โ€™, โ€˜the origami {}.โ€™, โ€˜the {} in a video game.โ€™, โ€˜a sketch of a {}.โ€™, โ€˜a doodle of the {}.โ€™, โ€˜a origami {}.โ€™, โ€˜a low resolution photo of a {}.โ€™, โ€˜the toy {}.โ€™, โ€˜a rendition of the {}.โ€™, โ€˜a photo of the clean {}.โ€™, โ€˜a photo of a large {}.โ€™, โ€˜a rendition of a {}.โ€™, โ€˜a photo of a nice {}.โ€™, โ€˜a photo of a weird {}.โ€™, โ€˜a blurry photo of a {}.โ€™, โ€˜a cartoon {}.โ€™, โ€˜art of a {}.โ€™, โ€˜a sketch of the {}.โ€™, โ€˜a embroidered {}.โ€™, โ€˜a pixelated photo of a {}.โ€™, โ€˜itap of the {}.โ€™, โ€˜a jpeg corrupted photo of the {}.โ€™, โ€˜a good photo of a {}.โ€™, โ€˜a plushie {}.โ€™, โ€˜a photo of the nice {}.โ€™, โ€˜a photo of the small {}.โ€™, โ€˜a photo of the weird {}.โ€™, โ€˜the cartoon {}.โ€™, โ€˜art of the {}.โ€™, โ€˜a drawing of the {}.โ€™, โ€˜a photo of the large {}.โ€™, โ€˜a black and white photo of a {}.โ€™, โ€˜the plushie {}.โ€™, โ€˜a dark photo of a {}.โ€™, โ€˜itap of a {}.โ€™, โ€˜graffiti of the {}.โ€™, โ€˜a toy {}.โ€™, โ€˜itap of my {}.โ€™, โ€˜a photo of a cool {}.โ€™, โ€˜a photo of a small {}.โ€™, โ€˜a tattoo of the {}.โ€™, โ€˜there is a {} in the scene.โ€™, โ€˜there is the {} in the scene.โ€™, โ€˜this is a {} in the scene.โ€™, โ€˜this is the {} in the scene.โ€™, โ€˜this is one {} in the scene.โ€™]
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
#### 3) Feature Surgery
- ์ด๋ฏธ์ง€ ํ† ํฐ ร— ํ…์ŠคํŠธ ์ž„๋ฒ ๋”ฉ์˜ **์›์†Œ๊ณฑ**์œผ๋กœ ํด๋ž˜์Šค๋ณ„ ํŠน์ง•์„ ๋งŒ๋“ค๊ณ ,
- ํด๋ž˜์Šค ์ถ•์œผ๋กœ **ํ‰๊ท (= ๊ณตํ†ต/์ค‘๋ณต ์„ฑ๋ถ„)** ์„ ์ถ”์ • ํ›„ **๋นผ๊ธฐ** โ†’ **foreground์— ์ง‘์ค‘**.
- ๊ทธ ๊ฒฐ๊ณผ๋ฅผ ์ฑ„๋„ ํ•ฉ์‚ฐํ•˜์—ฌ **[B, 1+HW, N]** ์œ ์‚ฌ๋„ ํ…์„œ ํš๋“.  
- clip.py์˜ ์•„๋ž˜์ฝ”๋“œ๊ฐ€ ๊ทธ ์—ญํ• ์„!!  

```python
def clip_feature_surgery(image_features, text_features, redundant_feats=None, t=2):

    if redundant_feats != None:
        similarity = image_features @ (text_features - redundant_feats).t()

    else:
        # weights to restrain influence of obvious classes on others
        prob = image_features[:, :1, :] @ text_features.t()
        prob = (prob * 2).softmax(-1)
        w = prob / prob.mean(-1, keepdim=True)

        # element-wise multiplied features
        b, n_t, n_i, c = image_features.shape[0], text_features.shape[0], image_features.shape[1], image_features.shape[2]
        feats = image_features.reshape(b, n_i, 1, c) * text_features.reshape(1, 1, n_t, c)
        feats *= w.reshape(1, 1, n_t, 1)
        redundant_feats = feats.mean(2, keepdim=True) # along cls dim
        feats = feats - redundant_feats
        
        # sum the element-wise multiplied features as cosine similarity
        similarity = feats.sum(-1)

    return similarity

4) Similarity Map ๋งŒ๋“ค๊ธฐ

  • CLS๋ฅผ ์ œ์™ธํ•˜๊ณ (์œ„์น˜ ์ •๋ณด ์—†์Œ) ํŒจ์น˜ ํ† ํฐ(HW) ๋งŒ ์‚ฌ์šฉ.
  • ์ฑ„๋„๋ณ„ Minโ€“Max ์ •๊ทœํ™” ํ›„, ์›๋ณธ ์ด๋ฏธ์ง€ ํฌ๊ธฐ๋กœ bilinear ์—…์ƒ˜ํ”Œ.
  • ํƒ€๊นƒ ํด๋ž˜์Šค(dog) ์ฑ„๋„๋งŒ ๊ณจ๋ผ ํžˆํŠธ๋งต ์˜ค๋ฒ„๋ ˆ์ด.

๐Ÿงญ ๋‹ค๋งŒ!!(Failure case์ฐพ๊ธฐ)

์—ฌ์ „ํžˆ ๊ฐ์ฑ„ ๋‚ด๋ถ€์˜ part segmentation์€ ๋˜์ง€ ์•Š์•˜์Šต๋‹ˆ๋‹ค!! Image

์ผ๋ถ€์—์„œ๋Š” ๋˜๋Š”๊ฒƒ ๊ฐ™๊ธฐ๋„!? Image

  • ์ฆ‰ ์‚ฌ์ง„์˜ ํ˜•ํƒœ, ํƒ€๊ฒŸ ๊ฐ์ฒด์— ๋”ฐ๋ผ ๋‹ค๋ฅธ๊ฐ€๋ณด์•„์š”~~

โœ… ์ •๋ฆฌ

์ด ์‹ค์Šต์„ ํ†ตํ•ด ์ถ”๊ฐ€ ํ•™์Šต ์—†์ด๋„ CLIP์˜ ์„ค๋ช…๊ฐ€๋Šฅ์„ฑ์„ ํฌ๊ฒŒ ๋Œ์–ด์˜ฌ๋ฆด ์ˆ˜ ์žˆ์Œ์„ ํ™•์ธํ–ˆ์Šต๋‹ˆ๋‹ค.
ํ•ต์‹ฌ์€ (i) Consistent Self-Attention + Dual Path๋กœ ๊ตฌ์กฐ์  ๋ฌธ์ œ๋ฅผ ์ค„์ด๊ณ , (ii) Feature Surgery๋กœ ์ค‘๋ณต ํŠน์ง•์„ ์ œ๊ฑฐํ•ด foreground๋ฅผ ๋˜๋ ทํ•˜๊ฒŒ ๋งŒ๋“œ๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

This post is licensed under CC BY 4.0 by the author.