๐ ClipSurgery Hands-on: ClipSurgery ์ ์ค์ตํด๋ณด์!
๐งฌ (English) Practicing ClipSurgery!!
Today we will practice with the CLIP model that has been surgically modified โ
ClipSurgery!
Weโll run it directly and generate a clean similarity map (CAM) on top of an image!!
โ Environment Setup!!
- First, clone the Clip Surgery Git repo:
1
git clone https://github.com/xmed-lab/CLIP_Surgery.git
- This includes
clip.py
,clip_surgery_model.py
, and alsodemo.ipynb
. We will customize this
demo.ipynb
.- Additionally:
- Python โฅ 3.9, CUDA recommended (but CPU works too)
- Required libraries:
torch
,opencv-python
,numpy
,Pillow
,matplotlib
,torchvision
- From the CLIP_Surgery repo (or equivalent module) we need:
clip.load("CS-ViT-B/16", ...)
(the surgically modified vision backbone)encode_text_with_prompt_ensemble
clip_feature_surgery
get_similarity_map
- One image file for visualization (e.g.
dog.jpg
)
๐ง What will we do?
- Extract image token features with the CS-ViT-B/16 model (with Architecture Surgery),
- Build stable class (text) embeddings via Prompt Ensemble,
- Apply Feature Surgery to remove redundant/common class features,
- Generate and visualize a foreground-focused similarity map.
๐งช Letโs Start!! โ Full Code
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
import clip
import torch
import cv2
import numpy as np
from PIL import Image
from matplotlib import pyplot as plt
from torchvision.transforms import Compose, Resize, ToTensor, Normalize
from torchvision.transforms import InterpolationMode
BICUBIC = InterpolationMode.BICUBIC
# from segment_anything import sam_model_registry, SamPredictor
# 0) Device setup
device = "cuda" if torch.cuda.is_available() else "cpu"
# 1) (Optional) Load original CLIP โ for comparison
model, _ = clip.load("ViT-B/16", device=device)
model.eval()
# 2) Preprocessing pipeline
preprocess = Compose([
Resize((224, 224), interpolation=BICUBIC),
ToTensor(),
Normalize((0.48145466, 0.4578275, 0.40821073),
(0.26862954, 0.26130258, 0.27577711))
])
# 3) Load input image
pil_img = Image.open(f"/{my_path}/dog.jpg" )
cv2_img = cv2.cvtColor(np.array(pil_img), cv2.COLOR_RGB2BGR)
image = preprocess(pil_img).unsqueeze(0).to(device)
# 4) Class dictionary โ Feature Surgery requires multiple classes
all_texts = [
'airplane', 'bag', 'bed', 'bedclothes', 'bench', 'bicycle', 'bird', 'boat',
'book', 'bottle', 'building', 'bus', 'cabinet', 'car', 'cat', 'ceiling',
'chair', 'cloth', 'computer', 'cow', 'cup', 'curtain', 'dog', 'door',
'fence', 'floor', 'flower', 'food', 'grass', 'ground', 'horse', 'keyboard',
'light', 'motorbike', 'mountain', 'mouse', 'person', 'plate', 'platform',
'potted plant', 'road', 'rock', 'sheep', 'shelves', 'sidewalk', 'sign',
'sky', 'snow', 'sofa', 'table', 'track', 'train', 'tree', 'truck',
'tv monitor', 'wall', 'water', 'window', 'wood'
]
target_texts = ['dog']
# 5) Load surgically modified architecture (CS-ViT-B/16)
model, preprocess_unused = clip.load("CS-ViT-B/16", device=device)
model.eval()
with torch.no_grad():
# (A) Image features (per token, including CLS)
image_features = model.encode_image(image) # [B, 1+HW, C]
image_features = image_features / image_features.norm(dim=-1, keepdim=True)
# (B) Prompt ensemble-based text features
text_features = clip.encode_text_with_prompt_ensemble(model, all_texts, device) # [N, C]
# (C) Feature Surgery โ remove redundant features
similarity = clip.clip_feature_surgery(image_features, text_features) # [B, 1+HW, N]
# (D) Generate similarity map from patch tokens only + upsample
similarity_map = clip.get_similarity_map(similarity[:, 1:, :], cv2_img.shape[:2]) # [B, H, W, N]
# (E) Visualization โ Overlay target class only
for b in range(similarity_map.shape[0]):
for n in range(similarity_map.shape[-1]):
if all_texts[n] not in target_texts:
continue
vis = (similarity_map[b, :, :, n].cpu().numpy() * 255).astype('uint8')
vis = cv2.applyColorMap(vis, cv2.COLORMAP_JET)
vis = cv2_img * 0.4 + vis * 0.6
vis = cv2.cvtColor(vis.astype('uint8'), cv2.COLOR_BGR2RGB)
print('CLIP Surgery:', all_texts[n])
plt.imshow(vis)
plt.axis('off')
plt.show()
Running this code immediately segments the dog quite well!
๐ Code Explanation (Key Points)
1) CS-ViT-B/16: Backbone with Architecture Surgery
- In several final blocks, set q=k=v(=V) to perform Consistent Self-Attention.
- Introduce a Dual Path:
- The CAM path skips the FFN, reducing background/noise influence.
- The original CLIP path is preserved to maintain embedding quality.
2) clip.encode_text_with_prompt_ensemble
for Prompt Ensemble
- Using only
"a photo of a {}."
is unstable, so multiple templates are averaged and normalized โ stable class embeddings. - Example templates (in
clip.py
):1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
prompt_templates = ['a bad photo of a {}.', 'a photo of many {}.', 'a sculpture of a {}.', 'a photo of the hard to see {}.', 'a low resolution photo of the {}.', 'a rendering of a {}.', 'graffiti of a {}.', 'a bad photo of the {}.', 'a cropped photo of the {}.', 'a tattoo of a {}.', 'the embroidered {}.', 'a photo of a hard to see {}.', 'a bright photo of a {}.', 'a photo of a clean {}.', 'a photo of a dirty {}.', 'a dark photo of the {}.', 'a drawing of a {}.', 'a photo of my {}.', 'the plastic {}.', 'a photo of the cool {}.', 'a close-up photo of a {}.', 'a black and white photo of the {}.', 'a painting of the {}.', 'a painting of a {}.', 'a pixelated photo of the {}.', 'a sculpture of the {}.', 'a bright photo of the {}.', 'a cropped photo of a {}.', 'a plastic {}.', 'a photo of the dirty {}.', 'a jpeg corrupted photo of a {}.', 'a blurry photo of the {}.', 'a photo of the {}.', 'a good photo of the {}.', 'a rendering of the {}.', 'a {} in a video game.', 'a photo of one {}.', 'a doodle of a {}.', 'a close-up photo of the {}.', 'a photo of a {}.', 'the origami {}.', 'the {} in a video game.', 'a sketch of a {}.', 'a doodle of the {}.', 'a origami {}.', 'a low resolution photo of a {}.', 'the toy {}.', 'a rendition of the {}.', 'a photo of the clean {}.', 'a photo of a large {}.', 'a rendition of a {}.', 'a photo of a nice {}.', 'a photo of a weird {}.', 'a blurry photo of a {}.', 'a cartoon {}.', 'art of a {}.', 'a sketch of the {}.', 'a embroidered {}.', 'a pixelated photo of a {}.', 'itap of the {}.', 'a jpeg corrupted photo of the {}.', 'a good photo of a {}.', 'a plushie {}.', 'a photo of the nice {}.', 'a photo of the small {}.', 'a photo of the weird {}.', 'the cartoon {}.', 'art of the {}.', 'a drawing of the {}.', 'a photo of the large {}.', 'a black and white photo of a {}.', 'the plushie {}.', 'a dark photo of a {}.', 'itap of a {}.', 'graffiti of the {}.', 'a toy {}.', 'itap of my {}.', 'a photo of a cool {}.', 'a photo of a small {}.', 'a tattoo of the {}.', 'there is a {} in the scene.', 'there is the {} in the scene.', 'this is a {} in the scene.', 'this is the {} in the scene.', 'this is one {} in the scene.']
3) Feature Surgery
- Compute element-wise product of image tokens ร text embeddings.
- Estimate class-common features (mean across classes) โ subtract them โ focus on foreground.
- Summing produces similarity tensor [B, 1+HW, N].
- In
clip.py
:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
def clip_feature_surgery(image_features, text_features, redundant_feats=None, t=2):
if redundant_feats != None:
similarity = image_features @ (text_features - redundant_feats).t()
else:
# weights to restrain influence of obvious classes on others
prob = image_features[:, :1, :] @ text_features.t()
prob = (prob * 2).softmax(-1)
w = prob / prob.mean(-1, keepdim=True)
# element-wise multiplied features
b, n_t, n_i, c = image_features.shape[0], text_features.shape[0], image_features.shape[1], image_features.shape[2]
feats = image_features.reshape(b, n_i, 1, c) * text_features.reshape(1, 1, n_t, c)
feats *= w.reshape(1, 1, n_t, 1)
redundant_feats = feats.mean(2, keepdim=True) # along cls dim
feats = feats - redundant_feats
# sum the element-wise multiplied features as cosine similarity
similarity = feats.sum(-1)
return similarity
4) Building the Similarity Map
- Exclude CLS (no spatial info), use only patch tokens (HW).
- Normalize each channel with MinโMax, upsample with bilinear to original image size.
- Select target class channel (
dog
) and overlay as a heatmap.
๐งญ Limitations (Finding Failure Cases)
- Results vary depending on the image and the target object.
โ Summary
This practice showed that we can greatly improve CLIPโs explainability without additional training.
Key ideas:
- (i) Consistent Self-Attention + Dual Path reduces structural issues,
- (ii) Feature Surgery removes redundant features, making the foreground stand out clearly.
๐งฌ (ํ๊ตญ์ด) ClipSurgery ์ ์ค์ต!!
์ค๋์ CLIP๋ชจ๋ธ์ ๋ด๋ถ๊ตฌ์กฐ๋ฅผ ์์ ํ!! ClipSurgery ์ ํ์ด์ฌ ์ค์ต์ ํด๋ณด๊ฒ ์ต๋๋ค!!
์ง์ ์คํํด ๋ณด๋ฉด์, ์ด๋ฏธ์ง ์์ ๊นจ๋ํ similarity map(CAM) ์ ๊ทธ๋ฆฌ๋ ๊ณผ์ ์ ํจ๊นจ ํด ๋ณด์์!!
โ ํ๊ฒฝ์ธํ !!
- ์ฐ์ !! Clip Surgery Git repo๋ฅผ ๋ค์ด๋ฐ์ต๋๋ค!!
1
git clone https://github.com/xmed-lab/CLIP_Surgery.git
- ๊ทธ๋ผ
clip.py
์clip_surgery_model.py
๊ฐ ๋ค์ด์๊ณ !!demo.ipynb
๋ ์์ง์~ ์ ํฌ๋ ์ด
demo.ipynb
๋ฅผ ์ปค์คํฐ๋ง์ด์ง ์์ผ๋ณด์์ต๋๋ค!!- ์ถ๊ฐ๋ก
- Python โฅ 3.9, CUDA ํ๊ฒฝ ๊ถ์ฅ(์์ด๋ CPU๋ก ๋์)
- ํ์ ๋ผ์ด๋ธ๋ฌ๋ฆฌ:
torch
,opencv-python
,numpy
,Pillow
,matplotlib
,torchvision
- CLIP_Surgery ๋ ํฌ(ํน์ ๋์ผ ๊ธฐ๋ฅ์ ๋ชจ๋)์์ ์ ๊ณตํ๋:
clip.load("CS-ViT-B/16", ...)
(์์ ๋ ๋น์ ๋ฐฑ๋ณธ)encode_text_with_prompt_ensemble
clip_feature_surgery
get_similarity_map
- ์๊ฐํํ ์ด๋ฏธ์ง ํ์ผ 1์ฅ (์:
dog.jpg
)
๐ง ๋ฌด์์ ํ๊ฒ ๋๋?
- Architecture Surgery๊ฐ ์ ์ฉ๋ CS-ViT-B/16 ๋ชจ๋ธ๋ก ์ด๋ฏธ์ง ํ ํฐ ํน์ง์ ๋ฝ๊ณ ,
- Prompt Ensemble๋ก ํด๋์ค(ํ ์คํธ) ํน์ง์ ์์ ์ ์ผ๋ก ๋ง๋ ๋ค,
- Feature Surgery๋ก ํด๋์ค ๊ณตํต(์ค๋ณต) ํน์ง์ ์ ๊ฑฐํ์ฌ
- foreground์ ์ง์คํ๋ similarity map์ ์์ฑ/์๊ฐํํฉ๋๋ค.
๐งช ๋ฐ๋ก ์์!! - ์ ์ฒด ์ฝ๋
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
import clip
import torch
import cv2
import numpy as np
from PIL import Image
from matplotlib import pyplot as plt
from torchvision.transforms import Compose, Resize, ToTensor, Normalize
from torchvision.transforms import InterpolationMode
BICUBIC = InterpolationMode.BICUBIC
# from segment_anything import sam_model_registry, SamPredictor
# 0) ๋๋ฐ์ด์ค ์ค์
device = "cuda" if torch.cuda.is_available() else "cpu"
# 1) (์ ํ) ์๋ณธ CLIP ๋ก๋ฉ โ ๋น๊ต์ฉ
model, _ = clip.load("ViT-B/16", device=device)
model.eval()
# 2) ์ ์ฒ๋ฆฌ ํ์ดํ๋ผ์ธ
preprocess = Compose([
Resize((224, 224), interpolation=BICUBIC),
ToTensor(),
Normalize((0.48145466, 0.4578275, 0.40821073),
(0.26862954, 0.26130258, 0.27577711))
])
# 3) ์
๋ ฅ ์ด๋ฏธ์ง ๋ถ๋ฌ์ค๊ธฐ
pil_img = Image.open(f"/{my_path}/dog.jpg" )
cv2_img = cv2.cvtColor(np.array(pil_img), cv2.COLOR_RGB2BGR)
image = preprocess(pil_img).unsqueeze(0).to(device)
# 4) ํด๋์ค ์ฌ์ โ Feature Surgery๋ ์ฌ๋ฌ ํด๋์ค๋ฅผ ํจ๊ป ๋ด์ผ ๊ณตํต(์ค๋ณต) ์ฑ๋ถ์ ์ ๊ฑฐํ ์ ์์
all_texts = [
'airplane', 'bag', 'bed', 'bedclothes', 'bench', 'bicycle', 'bird', 'boat',
'book', 'bottle', 'building', 'bus', 'cabinet', 'car', 'cat', 'ceiling',
'chair', 'cloth', 'computer', 'cow', 'cup', 'curtain', 'dog', 'door',
'fence', 'floor', 'flower', 'food', 'grass', 'ground', 'horse', 'keyboard',
'light', 'motorbike', 'mountain', 'mouse', 'person', 'plate', 'platform',
'potted plant', 'road', 'rock', 'sheep', 'shelves', 'sidewalk', 'sign',
'sky', 'snow', 'sofa', 'table', 'track', 'train', 'tree', 'truck',
'tv monitor', 'wall', 'water', 'window', 'wood'
]
target_texts = ['dog']
# 5) ์์ ๋ ์ํคํ
์ฒ(CS-ViT-B/16) ๋ก๋ฉ
model, preprocess_unused = clip.load("CS-ViT-B/16", device=device)
model.eval()
with torch.no_grad():
# (A) ์ด๋ฏธ์ง ํน์ง (ํ ํฐ ๋จ์) โ CLS ํฌํจ
image_features = model.encode_image(image) # [B, 1+HW, C]
image_features = image_features / image_features.norm(dim=-1, keepdim=True)
# (B) Prompt ensemble ๊ธฐ๋ฐ ํ
์คํธ ํน์ง โ ํด๋์ค๋ณ ์์ ํ๋ ์๋ฒ ๋ฉ
text_features = clip.encode_text_with_prompt_ensemble(model, all_texts, device) # [N, C]
# (C) Feature Surgery โ ํด๋์ค ๊ณตํต(์ค๋ณต) ์ฑ๋ถ ์ ๊ฑฐ๋ ์ ์ฌ๋ ๊ณ์ฐ
similarity = clip.clip_feature_surgery(image_features, text_features) # [B, 1+HW, N]
# (D) ํจ์น(์์น)๋ง ์ฌ์ฉํ์ฌ similarity map ์์ฑ + ์๋ณธ ํฌ๊ธฐ๋ก ์
์ํ
similarity_map = clip.get_similarity_map(similarity[:, 1:, :], cv2_img.shape[:2]) # [B, H, W, N]
# (E) ์๊ฐํ โ ํ๊น ํด๋์ค๋ง Overlay
for b in range(similarity_map.shape[0]):
for n in range(similarity_map.shape[-1]):
if all_texts[n] not in target_texts:
continue
vis = (similarity_map[b, :, :, n].cpu().numpy() * 255).astype('uint8')
vis = cv2.applyColorMap(vis, cv2.COLORMAP_JET)
vis = cv2_img * 0.4 + vis * 0.6
vis = cv2.cvtColor(vis.astype('uint8'), cv2.COLOR_BGR2RGB)
print('CLIP Surgery:', all_texts[n])
plt.imshow(vis)
plt.axis('off')
plt.show()
์ ์ฝ๋๋ฅผ ํด๋ณด๋ฆฌ๋ฉด!! ๋ฐ๋ก dog์ ๋ํด์ segmentation์ ์ ํด๋ฒ๋ฆฝ๋๋ค!!
๐ ์ฝ๋ ํด์ค (ํต์ฌ ํฌ์ธํธ)
1) CS-ViT-B/16: Architecture Surgery๊ฐ ์ ์ฉ๋ ๋ฐฑ๋ณธ - clip_surgery_model.py์ ๋ฐ๋ ๋ด์ฉ์ด์์ต๋๋ค! ์ธ๋ถ ๋ถ์์ ํ์ดํผ ๋ฆฌ๋ฉ ํฌ์คํ ์ ์ฐธ๊ณ !!
- ๋ง์ง๋ง ์ฌ๋ฌ ๋ธ๋ก์์ q=k=v(=V)๋ก ๋ฐ๊ฟ Consistent Self-Attention์ ์ํํ๊ณ ,
- Dual Path๋ก CAM์ฉ ๊ฒฝ๋ก๋ FFN์ ์คํตํ์ฌ ๋ฐฐ๊ฒฝ/๋ ธ์ด์ฆ ์ํฅ์ ์ค์ ๋๋ค.
- ์๋(CLIP) ๊ฒฝ๋ก๋ ๊ทธ๋๋ก ์ ์งํด ๋ณธ๋ ์๋ฒ ๋ฉ ์ฑ๋ฅ ๋ณด์กด.
2) clip.encode_text_with_prompt_ensemble
์์ ์งํ๋๋ Prompt Ensemble
"a photo of a {}."
ํ๋๋ก๋ง ํ๋ฉด ๋ฐ์์ด ๋ค๋ฅผ์ ์์ผ๋ ๊ทธ ์ธ์๋ ๋ค์ํ ํ ํ๋ฆฟ์ผ๋ก ํ ์คํธ ์๋ฒ ๋ฉ์ ํ๊ท /์ ๊ทํ โ ํด๋์ค ์๋ฒ ๋ฉ ์์ ํ.- ๋ค์ํ ํ๋กฌํฌํธ๋?? ์๋์ ๊ฐ์์ต๋๋ค!!(clip.py์ ์์!!)
```python prompt_templates = [โa bad photo of a {}.โ, โa photo of many {}.โ, โa sculpture of a {}.โ, โa photo of the hard to see {}.โ, โa low resolution photo of the {}.โ, โa rendering of a {}.โ, โgraffiti of a {}.โ, โa bad photo of the {}.โ, โa cropped photo of the {}.โ, โa tattoo of a {}.โ, โthe embroidered {}.โ, โa photo of a hard to see {}.โ, โa bright photo of a {}.โ, โa photo of a clean {}.โ, โa photo of a dirty {}.โ, โa dark photo of the {}.โ, โa drawing of a {}.โ, โa photo of my {}.โ, โthe plastic {}.โ, โa photo of the cool {}.โ, โa close-up photo of a {}.โ, โa black and white photo of the {}.โ, โa painting of the {}.โ, โa painting of a {}.โ, โa pixelated photo of the {}.โ, โa sculpture of the {}.โ, โa bright photo of the {}.โ, โa cropped photo of a {}.โ, โa plastic {}.โ, โa photo of the dirty {}.โ, โa jpeg corrupted photo of a {}.โ, โa blurry photo of the {}.โ, โa photo of the {}.โ, โa good photo of the {}.โ, โa rendering of the {}.โ, โa {} in a video game.โ, โa photo of one {}.โ, โa doodle of a {}.โ, โa close-up photo of the {}.โ, โa photo of a {}.โ, โthe origami {}.โ, โthe {} in a video game.โ, โa sketch of a {}.โ, โa doodle of the {}.โ, โa origami {}.โ, โa low resolution photo of a {}.โ, โthe toy {}.โ, โa rendition of the {}.โ, โa photo of the clean {}.โ, โa photo of a large {}.โ, โa rendition of a {}.โ, โa photo of a nice {}.โ, โa photo of a weird {}.โ, โa blurry photo of a {}.โ, โa cartoon {}.โ, โart of a {}.โ, โa sketch of the {}.โ, โa embroidered {}.โ, โa pixelated photo of a {}.โ, โitap of the {}.โ, โa jpeg corrupted photo of the {}.โ, โa good photo of a {}.โ, โa plushie {}.โ, โa photo of the nice {}.โ, โa photo of the small {}.โ, โa photo of the weird {}.โ, โthe cartoon {}.โ, โart of the {}.โ, โa drawing of the {}.โ, โa photo of the large {}.โ, โa black and white photo of a {}.โ, โthe plushie {}.โ, โa dark photo of a {}.โ, โitap of a {}.โ, โgraffiti of the {}.โ, โa toy {}.โ, โitap of my {}.โ, โa photo of a cool {}.โ, โa photo of a small {}.โ, โa tattoo of the {}.โ, โthere is a {} in the scene.โ, โthere is the {} in the scene.โ, โthis is a {} in the scene.โ, โthis is the {} in the scene.โ, โthis is one {} in the scene.โ]
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
#### 3) Feature Surgery
- ์ด๋ฏธ์ง ํ ํฐ ร ํ
์คํธ ์๋ฒ ๋ฉ์ **์์๊ณฑ**์ผ๋ก ํด๋์ค๋ณ ํน์ง์ ๋ง๋ค๊ณ ,
- ํด๋์ค ์ถ์ผ๋ก **ํ๊ท (= ๊ณตํต/์ค๋ณต ์ฑ๋ถ)** ์ ์ถ์ ํ **๋นผ๊ธฐ** โ **foreground์ ์ง์ค**.
- ๊ทธ ๊ฒฐ๊ณผ๋ฅผ ์ฑ๋ ํฉ์ฐํ์ฌ **[B, 1+HW, N]** ์ ์ฌ๋ ํ
์ ํ๋.
- clip.py์ ์๋์ฝ๋๊ฐ ๊ทธ ์ญํ ์!!
```python
def clip_feature_surgery(image_features, text_features, redundant_feats=None, t=2):
if redundant_feats != None:
similarity = image_features @ (text_features - redundant_feats).t()
else:
# weights to restrain influence of obvious classes on others
prob = image_features[:, :1, :] @ text_features.t()
prob = (prob * 2).softmax(-1)
w = prob / prob.mean(-1, keepdim=True)
# element-wise multiplied features
b, n_t, n_i, c = image_features.shape[0], text_features.shape[0], image_features.shape[1], image_features.shape[2]
feats = image_features.reshape(b, n_i, 1, c) * text_features.reshape(1, 1, n_t, c)
feats *= w.reshape(1, 1, n_t, 1)
redundant_feats = feats.mean(2, keepdim=True) # along cls dim
feats = feats - redundant_feats
# sum the element-wise multiplied features as cosine similarity
similarity = feats.sum(-1)
return similarity
4) Similarity Map ๋ง๋ค๊ธฐ
- CLS๋ฅผ ์ ์ธํ๊ณ (์์น ์ ๋ณด ์์) ํจ์น ํ ํฐ(HW) ๋ง ์ฌ์ฉ.
- ์ฑ๋๋ณ MinโMax ์ ๊ทํ ํ, ์๋ณธ ์ด๋ฏธ์ง ํฌ๊ธฐ๋ก bilinear ์ ์ํ.
- ํ๊น ํด๋์ค(
dog
) ์ฑ๋๋ง ๊ณจ๋ผ ํํธ๋งต ์ค๋ฒ๋ ์ด.
๐งญ ๋ค๋ง!!(Failure case์ฐพ๊ธฐ)
์ฌ์ ํ ๊ฐ์ฑ ๋ด๋ถ์ part segmentation์ ๋์ง ์์์ต๋๋ค!!
- ์ฆ ์ฌ์ง์ ํํ, ํ๊ฒ ๊ฐ์ฒด์ ๋ฐ๋ผ ๋ค๋ฅธ๊ฐ๋ณด์์~~
โ ์ ๋ฆฌ
์ด ์ค์ต์ ํตํด ์ถ๊ฐ ํ์ต ์์ด๋ CLIP์ ์ค๋ช
๊ฐ๋ฅ์ฑ์ ํฌ๊ฒ ๋์ด์ฌ๋ฆด ์ ์์์ ํ์ธํ์ต๋๋ค.
ํต์ฌ์ (i) Consistent Self-Attention + Dual Path๋ก ๊ตฌ์กฐ์ ๋ฌธ์ ๋ฅผ ์ค์ด๊ณ , (ii) Feature Surgery๋ก ์ค๋ณต ํน์ง์ ์ ๊ฑฐํด foreground๋ฅผ ๋๋ ทํ๊ฒ ๋ง๋๋ ๊ฒ์
๋๋ค.