๐ CLIP Surgery: A Closer Look at the Explainability of Contrastive Language-Image Pre-training
๐ (English) CLIP Surgery: Enhancing Explainability by Operating on CLIP!
- Title: A Closer Look at the Explainability of Contrastive Language-Image Pre-training (CLIP Surgery)
- Journal: Pattern Recognition (2025)
- Code: GitHub โ CLIP Surgery
- Keywords:
CLIP
,Explainability
,CAM
,Vision-Language
,Open-Vocabulary
- Summary: CLIP is a powerful vision-language model, but it often focuses on the background instead of the foreground or suffers from noisy activation. To address this, the authors propose Architecture Surgery and Feature Surgery, significantly improving explainability!
๐ Key Summary of CLIP Surgery
One-liner: โWithout extra training, only structural and feature surgeries enhance CLIPโs explainability!โ
1) Proving CLIPโs Issues: Found problems of inconsistency in self-attention and redundant features.
2) Feature Surgery: Successfully removed redundant features and suppressed unnecessary noisy activation โ produced clean CAMs!!
3) Training-Free Explainability: No fine-tuning required, explainability secured with original CLIP โ versatile applications!!
๐ Research Background
- CAM, Grad-CAM: Effective for CNN/ViT but fail on CLIP.
In CLIP they are noisy and produce opposite visualization. In other words, localization fails!!
- Why do they fail in CLIP?
- Because self-attention links inconsistent semantic regions, and redundant features emphasize background rather than foreground.
a. Why does self-attention link inconsistent semantic regions?
a-1. CLIP was only trained for global imageโtext matching, so attention didnโt need to focus precisely on object interiors.
a-2. CLIPโs Query, Key, and Value parameters are different (heterologous), so Q/K relations connect inconsistent semantic areas.A_raw = ฯ(s ยท QK_โค)V : heterologous parameters
A_con = ฯ(s ยท VV_โค)V : homogeneous parametersb. Why do redundant features cause noise?
b-1. CLIP trains on many categories at once, so shared features (e.g., โskyโ, โgrassโ, โroadโ) frequently appear.
b-2. These generic features are often in the background, so self-attention is easily pulled toward them, causing noisy activation.
- Alignment-based approaches exist but require extra models, layers, or fine-tuning (not training-free).
- ECLIP: Realigns CLIP features with segmentation masks using self-supervision.
- RCLIP: Uses bounding box annotations to refine CLIPโs imageโtext features per object.
- Both require retraining (fine-tuning).
๐งฑ CLIP Surgery Architecture
i) Architecture Surgery (Fixing Structural Issues)
- Raw self-attention (i-1) connects inconsistent semantic regions.
- Consistent self-attention (i-2) prevents unnecessary background emphasis.
mFSR measures how much self-attention focuses on the foreground (object).
i-1) raw self-attention : A_raw = ฯ(s ยท QK_T)V
i-2) consistent self-attention : A_con = ฯ(s ยท VV_โค)V
- Code example: Transformer
Attention
forward pass
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
# i-1) Raw Self-attention
def forward(self, x):
B, N, C = x.shape
qkv = self.qkv(x).reshape(B, N, 3, self.num_heads, C // self.num_heads).permute(2, 0, 3, 1, 4)
q, k, v = qkv[0], qkv[1], qkv[2]
# original self-attention for the original path
attn_ori = (q @ k.transpose(-2, -1)) * self.scale
attn_ori = attn_ori.softmax(dim=-1)
attn_ori = self.attn_drop(attn_ori)
# i-2) consistent Self-attention
# replace k & q by v
k = v
q = k
attn = (q @ k.transpose(-2, -1)) * scale
attn = (attn).softmax(dim=-1)
attn = self.attn_drop(attn)
## return both paths
x_ori = (attn_ori @ v).transpose(1, 2).reshape(B, N, C)
x = (attn @ v).transpose(1, 2).reshape(B, N, C) # clip_surgery
x = self.proj_drop(self.proj(x))
x_ori = self.proj_drop(self.proj(x_ori))
return [x, x_ori]
- Dual paths: one for CLIP embeddings, another for CAM generation. FFN skipped to avoid negative effects.
- In the figure, FFN inside CLIP Transformer focuses incorrectly, and early self-attention blocks are also inaccurate.
- Thus, in the new path, only self-attention is applied (FFN skipped).
The original path is kept to preserve embeddings.
- Code example:
ResidualAttentionBlock
forward
1
2
3
4
5
6
7
8
9
10
11
def forward(self, x):
# dual paths for blocks deeper than "d"
if isinstance(self.attn, Attention):
if isinstance(x, list):
x, x_ori = x ## x_ori = original path, x = new path
x_res = self.attention(self.ln_1(x_ori))
x_res, x_ori_res = x_res ## consistent vs raw self-attention
x_ori += x_ori_res
x_ori = x_ori + self.mlp(self.ln_2(x_ori)) # original path adds FFN
x += x_res # new path only adds consistent self-attention
return [x, x_ori]
ii) Feature Surgery (Fixing Representational Issues)
- CLIP learns many categories at once, leading to redundant shared features.
- Example: when target = โdogโ, embeddings of cat, sky, sea, airplane also overlap.
Small L1 distance means positive and empty overlap โ redundancy problem!!
- Code: clip_feature_surgery subtracts redundant_feats from all features.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
# in demp.py, define all_texts and target_texts
all_texts = [...]
target_texts = ['dog']
with torch.no_grad():
image_features = model.encode_image(image)
image_features = image_features / image_features.norm(dim=-1, keepdim=True)
text_features = clip.encode_text_with_prompt_ensemble(model, all_texts, device)
similarity = clip.clip_feature_surgery(image_features, text_features)
similarity_map = clip.get_similarity_map(features[:, 1:, :], cv2_img.shape[:2])
for b in range(similarity_map.shape[0]):
for n in range(similarity_map.shape[-1]):
if all_texts[n] not in target_texts:
continue
vis = (similarity_map[b, :, :, n].cpu().numpy() * 255).astype('uint8')
vis = cv2.applyColorMap(vis, cv2.COLORMAP_JET)
vis = cv2_img * 0.4 + vis * 0.6
vis = cv2.cvtColor(vis.astype('uint8'), cv2.COLOR_BGR2RGB)
print('CLIP:', all_texts[n])
plt.imshow(vis)
plt.show()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
## clip.py clip_feature_surgery implementation
def clip_feature_surgery(image_features, text_features, redundant_feats=None, t=2):
if redundant_feats != None:
similarity = image_features @ (text_features - redundant_feats).t()
else:
prob = image_features[:, :1, :] @ text_features.t()
prob = (prob * 2).softmax(-1)
w = prob / prob.mean(-1, keepdim=True)
b, n_t, n_i, c = image_features.shape[0], text_features.shape[0], image_features.shape[1], image_features.shape[2]
feats = image_features.reshape(b, n_i, 1, c) * text_features.reshape(1, 1, n_t, c)
feats *= w.reshape(1, 1, n_t, 1)
redundant_feats = feats.mean(2, keepdim=True) # along cls dim
feats = feats - redundant_feats
similarity = feats.sum(-1)
return similarity
๐งช Experimental Results
๐ฏ Explainability Benchmarks
- On VOC 2012, COCO, PascalContext:
- mIoU +22โ36% improvement
- mSC +48โ65% improvement
๐ฏ Open-Vocabulary Tasks
- Semantic Segmentation: best training-free method (PascalContext mIoU 29.3%)
- Multi-label Recognition: +11.61% mAP over CLIP on NUS-Wide
- Interactive Segmentation: replace manual labels by converting text โ points for SAM
- Multimodal Visualization: interpret CLIPโs training itself
[end]
token most often activated; non-object words like โinโ, โ.โ, โofโ also highly active!!- Suggests redundant tokens in CLIPโs vocabulary.
- Provides ideas for improving CLIP training in the future.
๐ Qualitative Comparison
- Original CLIP: emphasizes background + noisy
- CLIP Surgery: sharp, object-focused heatmaps
- โ Clear improvement vs Grad-CAM, Bi-Modal, gScoreCAM
๐งช Ablation Study
- Only Architecture Surgery (i) โ mSC +47.88%
- Add Feature Surgery (ii) โ extra +3.17%
- Without dual paths โ collapse occurs, proving itโs essential.
โ Conclusion
- CLIP Surgery solves CLIPโs fundamental explainability issues (opposite visualization, noisy activation).
- A training-free approach strengthens CAM-based interpretation.
- Directly applicable to downstream tasks like Semantic Segmentation, Multi-label Recognition, Interactive Segmentation.
- Provides key insights for understanding CLIP internals and guiding future model improvements.
๐ (ํ๊ตญ์ด) CLIP Surgery: CLIP์ ์์ ํด์ ์ค๋ช ๊ฐ๋ฅ์ฑ์ ๋์ด๋ค!
- ์ ๋ชฉ: A Closer Look at the Explainability of Contrastive Language-Image Pre-training (CLIP Surgery)
- ์ ๋: Pattern Recognition (2025)
- ์ฝ๋: GitHub โ CLIP Surgery
- ํต์ฌ ํค์๋:
CLIP
,Explainability
,CAM
,Vision-Language
,Open-Vocabulary
- ์์ฝ: CLIP์ ๊ฐ๋ ฅํ ๋น์ -์ธ์ด ๋ชจ๋ธ์ด์ง๋ง, foreground ๋์ background์ ์ง์คํ๊ฑฐ๋ ์ก์ ํ์ฑํ(noisy activation) ๋ฌธ์ ๊ฐ ์กด์ฌ. ์ด๋ฅผ ํด๊ฒฐํ๊ธฐ ์ํด Architecture Surgery์ Feature Surgery๋ฅผ ์ ์ฉํด ์ค๋ช ๊ฐ๋ฅ์ฑ์ ํฌ๊ฒ ๊ฐ์ ํ๋ ํ๋ ์์ํฌ ์ ์!
๐ CLIP Surgery ํต์ฌ ์์ฝ
ํ ์ค ์์ฝ: โ์ถ๊ฐ ํ์ต ์์ด, ๊ตฌ์กฐยทํน์ง ์์ ๋ง์ผ๋ก CLIP์ ์ค๋ช ๊ฐ๋ฅ์ฑ์ ๊ฐํํ๋ค!โ
1) CLIP ์ ๋ฌธ์ ์ฆ๋ช : ๊ธฐ์กด self-attention์ ๋ถ์ผ๊ด์ฑ, ์ค๋ณต ํน์ง์ ๋ฌธ์ ๋ค์ ๋ฐ๊ฒฌ
2) Feature Surgery : ์ค๋ณต(redundant) ํน์ง ์ ๊ฑฐ ๋ฐ ๋ถํ์ํ noisy activation ์ต์ ์ฑ๊ณตํด์ CAM ๋ง๋ฌ!!
3) Training-Free Explainability : Fine-tuning์ด ๋ถํ์, ์๋ณธ CLIP ๊ทธ๋๋ก ์ค๋ช ๊ฐ๋ฅ์ฑ ํ๋ณดํ์๊ธฐ์ ๋ค์ํ๊ฒ ํ์ฉ ๊ฐ๋ฅ!!
๐ ๊ธฐ์กด ์ฐ๊ตฌ์ ํ๋ฆ
- CAM, Grad-CAM: CNN/ViT์๋ ํจ๊ณผ์ ์ด์ง๋ง CLIP์๋ ์ ์ฉ ๋ถ๊ฐ
CLIP์์๋ โnoisyโํ๊ณ , โOpposite visualizationโํ๋ค. ์ฆ localization์ ๋ฌธ์ ๊ฐ์๋ค!!
- ๊ทธ๋ผ, CLIP์์๋ ์ ์๋ฌ์๊น?
- Self-attention ๊ตฌ์กฐ๊ฐ ์ผ๊ด๋์ง ์์ ์๋ฏธ ์์ญ์ ์ฐ๊ฒฐํ๊ณ , ์ค๋ณต ํน์ง ๋๋ฌธ์ ์ก์์ด ๋ฐ์ํด foreground ๋์ background๋ฅผ ๊ฐ์กฐํ๊ธฐ ๋๋ฌธ
a. Self-attention ๊ตฌ์กฐ๊ฐ ์ผ๊ด๋์ง ์์ ์๋ฏธ ์์ญ์ ์ฐ๊ฒฐํ๋ ์ด์ ๋? a-1. CLIP์ ์ด๋ฏธ์งโํ ์คํธ ์์ ์ ์ญ์ (global) ๋งค์นญ๋ง ํ์ตํ๊ธฐ์ attention์ด ์ธ๋ฐํ๊ฒ ๊ฐ์ฒด ๋ด๋ถ์๋ง ์ง์คํ ํ์๊ฐ ์์์!!
a-2. CLIP์ Query, Key, Value ํ๋ผ๋ฏธํฐ๊ฐ ์๋ก ๋ฌ๋ผ์(heterologous parameters) Query/Key๊ฐ ๋ง๋ ๊ด๊ณ๊ฐ ์ผ๊ด๋์ง ์์ ์๋ฏธ ์์ญ์ ์ฐ๊ฒฐA_raw = ฯ(s ยท QK_โค)V : heterologous parameters
A_con = ฯ(s ยท VV_โค)V : homogeneous parameters
b. ์ค๋ณต ํน์ง ๋๋ฌธ์ ์ก์์ด ๋ฐ์ํ๋ ์ด์ ๋? b-1. CLIP์ ๋ง์ ์นดํ ๊ณ ๋ฆฌ๋ฅผ ๋์์ ํ์ต, ํด๋์ค ๊ฐ ๊ณต์ ๋๋ ํน์ง(์: โํ๋โ, โํโ, โ๋๋กโ)์ด ์์ฃผ ๋ฑ์ฅ b-2. ๋ฒ์ฉ์ ํน์ง์ด ๋ฐฐ๊ฒฝ์ ์ฃผ๋ก ๊น๋ ค ์์ด์, self-attention์ด ์ฝ๊ฒ ๋ฐฐ๊ฒฝ์ผ๋ก ๋๋ ค๊ฐ noisy activation์ด ๋ฐ์
- Self-attention ๊ตฌ์กฐ๊ฐ ์ผ๊ด๋์ง ์์ ์๋ฏธ ์์ญ์ ์ฐ๊ฒฐํ๊ณ , ์ค๋ณต ํน์ง ๋๋ฌธ์ ์ก์์ด ๋ฐ์ํด foreground ๋์ background๋ฅผ ๊ฐ์กฐํ๊ธฐ ๋๋ฌธ
- Alignment ๊ธฐ๋ฐ ๊ธฐ๋ฒ๋ ์์ง๋ง ์ด๋ ์ถ๊ฐ์ ์ธ ๋ชจ๋ธ, ๋ ์ด์ด, ํน์ ํ์ธํ๋์ ํ์๋กํจ!(Not Traininig Free)
- ECLIP: CLIP feature์ segmentation mask๋ฅผ self-supervised๋ก ๋ค์ ์ ๋ ฌ(alignment)
- ์๋ CLIP์ด localization์ ์ง์ ๋ชปํ๋ฏ๋ก, mask ์ ๋ณด๋ฅผ ์ถ๊ฐ ํ์ตํ์ฌ ๋ณด์
- RCLIP: Bounding box annotation์ ํ์ฉํด CLIP์ ์ด๋ฏธ์งโํ
์คํธ feature๋ฅผ ๊ฐ์ฒด ๋จ์๋ก ๋ณด์
- ๊ฒฐ๊ตญ CLIP์ ์ฌํ์ต(fine-tuning)ํ๋ ๋ฐฉ์
- ECLIP: CLIP feature์ segmentation mask๋ฅผ self-supervised๋ก ๋ค์ ์ ๋ ฌ(alignment)
๐งฑ CLIP Surgery ๊ตฌ์กฐ (Architecture)
i) Architecture Surgery(๊ตฌ์กฐ์ ๋ฌธ์ ๊ฐ์ )
- Raw Self-attention(i-1)์ด ์ผ๊ด๋์ง ์์ ์๋ฏธ ์์ญ์ ์ฐ๊ฒฐํ๋ ๋ฌธ์ ๊ฐ ์๋๋ฐ,
- Consistent self-attention(i-2)์ผ๋ก ๋ถํ์ํ ๋ฐฐ๊ฒฝ ๊ฐ์กฐ ๋ฐฉ์ง
mFSR์ Self-Attention์ด ์ผ๋ง๋ foreground(๊ฐ์ฒด)์ ์ง์คํ๋๊ฐ๋ฅผ ์ธก์ ํ๋ ์งํ
i-1) raw self-attention : A_raw = ฯ(s ยท QK_T)V
i-2) consistent self-attention : A_con = ฯ(s ยท VV_โค)V
- ์ด๋ฅผ ์ฝ๋๋ก ๋ณด๋ฉด Transformer Attention ๋ถ๋ถ์ธ
Attention
forward ๋ถ๋ถ์์,1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
# i-1) Raw Self-attention def forward(self, x): B, N, C = x.shape qkv = self.qkv(x).reshape(B, N, 3, self.num_heads, C // self.num_heads).permute(2, 0, 3, 1, 4) q, k, v = qkv[0], qkv[1], qkv[2] # original self-attention for the original path attn_ori = (q @ k.transpose(-2, -1)) * self.scale attn_ori = attn_ori.softmax(dim=-1) attn_ori = self.attn_drop(attn_ori) # i-2) consistent Self-attention # replace k & q by v k = v q = k attn = (q @ k.transpose(-2, -1)) * scale attn = (attn).softmax(dim=-1) attn = self.attn_drop(attn) ## ๋ง๋ฌด๋ฆฌ!! ๋๋ค ์ฌ์ฉํ ์ ์๋๋ก return x_ori = (attn_ori @ v).transpose(1, 2).reshape(B, N, C) x = (attn @ v).transpose(1, 2).reshape(B, N, C) # clip_surgery #x = v.transpose(1, 2).reshape(B, N, C) # mask_clip x = self.proj_drop(self.proj(x)) x_ori = self.proj_drop(self.proj(x_ori)) return [x, x_ori]
Dual Paths๋ก CLIP ์๋ฒ ๋ฉ์ฉ๊ณผ CAM ์ด๋ฏธ์ง ์ ์์ฉ ๋๊ฐ ํผ์ฒ ์ถ์ถ, FFN์ ๋ถ์ ์ ์ํฅ ์ต์ํ
- ์ ์ด๋ฏธ์ง๋ฅผ ๋ณด๋ฉด, CLIP Transformer๋ด์์ FFN์ ์คํ๋ ค ์ด์ํ๋๋ฅผ ์ง์คํ๊ณ , ์ด๊ธฐ Self-Attention ๋ธ๋ก ๋ถ๋ถ๋ ๋ถ์ ํํ๋ค!
- ๊ทธ๋ ๊ธฐ์
์ ๊ฒฝ๋ก
์์๋ self-attention๋ง ์ ์ฉํ๊ณ FFN์ ์ค๋!! ํํธ
์๋ ๊ฒฝ๋ก
๋ ์ ์งํด์ CLIP ๋ณธ๋ ์๋ฒ ๋ฉ ๋ณด์กด- ์ด๋ฅผ ์ฝ๋๋ก ๋ณด๋ฉด, Transformer layer์ธ
ResidualAttentionBlock
forward ๋ถ๋ถ์์,1 2 3 4 5 6 7 8 9 10 11
def forward(self, x): # dual paths for blocks deeper than "d" if isinstance(self.attn, Attention): if isinstance(x, list): x, x_ori = x ## x_ori๋ ์๋๊ฒฝ๋ก, x ๋ ์๋ก์ด ๊ฒฝ๋ก! x_res = self.attention(self.ln_1(x_ori)) x_res, x_ori_res = x_res ## self-attention ๊ฒฐ๊ณผ ๋ฐํ. x_res ๋ `consistent self-attention`, x_ori_res๋ `raw self-attention` x_ori += x_ori_res # ์๋ ๊ฒฝ๋ก๋ self attention(x_ori_res)์ ๋ํ๊ณ x_ori = x_ori + self.mlp(self.ln_2(x_ori)) # ์๋ ๊ฒฝ๋ก์ FFN(x_ori)๋ฅผ ๋ํจ!! x += x_res # ์๋ก์ด ๊ฒฝ๋ก๋ self attention(x_ori) ๋ง ๋ํจ return [x, x_ori]
ii) Feature Surgery(ํํ์ ๋ฌธ์ ๊ฐ์ )
- CLIP์ ๋ง์ ์นดํ ๊ณ ๋ฆฌ๋ฅผ ๋์์ ํ์ต, ํด๋์ค ๊ฐ ๊ณต์ ๋๋ ํน์ง์ด ๋ง์ ๋ฌธ์ ๊ฐ ์๋๋ฐ,
๊ฐ์์ง
๊ฐ ํ๊ฒ์ด๋๋ผ๋ ๊ณ ์์ด, ํ๋, ๋ฐ๋ค, ๋นํ๊ธฐ ๋ฑ ๊ธฐํ ํ ์คํธ์ ์๋ฒ ๋ฉ๋ ํด๋ณด๊ณ , ์ค๋ณต๋๋ ๋ถ๋ถ์ ์ ๊ฑฐL1 ์ด ์์ผ๋ฉด ์ ์ฌํ๋ค๋ ๋ป. ์ฆ positive๋ empty๊ฐ ์ ์ฌํด๋ฒ๋ฆผ! ๊ทธ๋์ ์ค๋ณต์ ๋ฌธ์ ๊ฐ ๋ฐ์!!
- ์ฝ๋๋ก ๋ณด๋ฉด!! clip_feature_surgery ๋ถ๋ถ์์ ์ ์ฒด์์ ๊ฒน์น๋ ๋ถ์๋ฅผ ๊ฐ์ง๊ณ redundant_feats๋ฅผ ๋ง๋ค๊ณ ๊ฐ feats์์ redundant_feats๋ฅผ ๋นผ์ค
```pythondemp.py ์์ ๋ฏธ๋ฆฌ empty์ ํด๋นํ๋ all_text๋ฅผ ์ ์ธํด๋๊ณ target_text๋ ๋๋ค์
all_texts = [โairplaneโ, โbagโ, โbedโ, โbedclothesโ, โbenchโ, โbicycleโ, โbirdโ, โboatโ, โbookโ, โbottleโ, โbuildingโ, โbusโ, โcabinetโ, โcarโ, โcatโ, โceilingโ, โchairโ, โclothโ, โcomputerโ, โcowโ, โcupโ, โcurtainโ, โdogโ, โdoorโ, โfenceโ, โfloorโ, โflowerโ, โfoodโ, โgrassโ, โgroundโ, โhorseโ, โkeyboardโ, โlightโ, โmotorbikeโ, โmountainโ, โmouseโ, โpersonโ, โplateโ, โplatformโ, โpotted plantโ, โroadโ, โrockโ, โsheepโ, โshelvesโ, โsidewalkโ, โsignโ, โskyโ, โsnowโ, โsofaโ, โtableโ, โtrackโ, โtrainโ, โtreeโ, โtruckโ, โtv monitorโ, โwallโ, โwaterโ, โwindowโ, โwoodโ] target_texts = [โdogโ]
with torch.no_grad(): # Extract image features image_features = model.encode_image(image) image_features = image_features / image_features.norm(dim=-1, keepdim=True)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
# Prompt ensemble for text features with normalization
text_features = clip.encode_text_with_prompt_ensemble(model, all_texts, device)
# Similarity map from image tokens with min-max norm and resize, B,H,W,N
# ์ฌ๊ธฐ์ clip_feature_surgery ์งํ!!
similarity = clip.clip_feature_surgery(image_features, text_features)
similarity_map = clip.get_similarity_map(features[:, 1:, :], cv2_img.shape[:2])
for b in range(similarity_map.shape[0]):
for n in range(similarity_map.shape[-1]):
if all_texts[n] not in target_texts:
continue
## ์ฌ๊ธฐ์
vis = (similarity_map[b, :, :, n].cpu().numpy() * 255).astype('uint8')
vis = cv2.applyColorMap(vis, cv2.COLORMAP_JET)
vis = cv2_img * 0.4 + vis * 0.6
vis = cv2.cvtColor(vis.astype('uint8'), cv2.COLOR_BGR2RGB)
print('CLIP:', all_texts[n])
plt.imshow(vis)
plt.show() ```
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
## clip.py clip_feature_surgery ๋ถ๋ถ์์ ์ ์ฒด์์ ๊ฒน์น๋ ๋ถ์๋ฅผ ๊ฐ์ง๊ณ redundant_feats๋ฅผ ๋ง๋ค๊ณ ๊ฐ feats์์ redundant_feats๋ฅผ ๋นผ์ค
def clip_feature_surgery(image_features, text_features, redundant_feats=None, t=2):
if redundant_feats != None:
similarity = image_features @ (text_features - redundant_feats).t()
else:
# weights to restrain influence of obvious classes on others
prob = image_features[:, :1, :] @ text_features.t()
prob = (prob * 2).softmax(-1)
w = prob / prob.mean(-1, keepdim=True)
# element-wise multiplied features
b, n_t, n_i, c = image_features.shape[0], text_features.shape[0], image_features.shape[1], image_features.shape[2]
feats = image_features.reshape(b, n_i, 1, c) * text_features.reshape(1, 1, n_t, c)
feats *= w.reshape(1, 1, n_t, 1)
redundant_feats = feats.mean(2, keepdim=True) # along cls dim
feats = feats - redundant_feats
# sum the element-wise multiplied features as cosine similarity
similarity = feats.sum(-1)
return similarity
๐งช ์คํ ํ๊ฐ ๊ฒฐ๊ณผ
๐ฏ Explainability Benchmarks
- VOC 2012, COCO, PascalContext ๋ฑ์์
- mIoU +22~36% ๊ฐ์
- mSC +48~65% ๊ฐ์
๐ฏ Open-Vocabulary Tasks
- Semantic Segmentation: training ์๋ ๋ฐฉ๋ฒ ์ค ์ต๊ณ ์ฑ๋ฅ (์: PascalContext mIoU 29.3%)
- Multi-label Recognition: NUS-Wide์์ CLIP ๋๋น +11.61% mAP ํฅ์
- Interactive Segmentation: SAM์ ํ ์คํธโํฌ์ธํธ ๋ณํ์ผ๋ก ์์์ ๋ผ๋ฒจ ๋์ฒด
- Multimodal Visualization: CLIP์ ํ์ต ๊ณผ์ ์์ฒด ํด์ ๊ฐ๋ฅ
[end]
ํ ํฐ์ด ๊ฐ์ฅ ํํ ํ์ฑํ๋ ํ ์คํธ ํ ํฐ์ด๋ฉฐ, โinโ, โ.โ, โofโ ๊ฐ์ ๋น๊ฐ์ฒด(non-object) ๋จ์ด๋ ๋์ ๋ฐ์์ ๋ณด์!!- ์ด๋ CLIP์ ์ดํ ์ฌ์ ์ ๋ถํ์ํ ์ค๋ณต ํ ํฐ(redundant tokens)์ด ์กด์ฌํจ์ ์์ฌ
- ํฅํ CLIP ํ์ต ๊ณผ์ ๊ฐ์ ์์ด๋์ด๋ฅผ ์ ๊ณต!!
๐ ์ ์ฑ ๋น๊ต : ์ํ๋๊ตฌ๋ง!
- ์๋ณธ CLIP: background ๊ฐ์กฐ + ์ก์ ๋ค์
- CLIP Surgery: ์ ๋ช ํ๊ณ ๊ฐ์ฒด ์ค์ฌ heatmap ์์ฑ
- โ ๊ธฐ์กด Grad-CAM, Bi-Modal, gScoreCAM ๋๋น ํ์ฐํ ํฅ์๋ ์๊ฐํ ํ์ง
๐งช Ablation ๋ถ์
- Architecture Surgery(i)๋ง ์ ์ฉ โ mSC +47.88%
- Feature Surgery(ii) ์ถ๊ฐ โ ์ถ๊ฐ +3.17% ํฅ์
- Dual Paths ์์ผ๋ฉด collapse ๋ฐ์, ํต์ฌ ๋ชจ๋์์ ๊ฒ์ฆ
โ ๊ฒฐ๋ก
- CLIP Surgery๋ CLIP ๋ชจ๋ธ์ ๊ทผ๋ณธ์ ์ค๋ช ๊ฐ๋ฅ์ฑ ๋ฌธ์ (opposite visualization, noisy activation)๋ฅผ ํด๊ฒฐ
- ์ถ๊ฐ ํ์ต ์๋ training-free ์ ๊ทผ์ผ๋ก CAM ๊ธฐ๋ฐ ํด์ ๊ฐํ
- Semantic Segmentation, Multi-label Recognition, Interactive Segmentation ๋ฑ ๋ค์ํ ๋ค์ด์คํธ๋ฆผ ์์ ์ ์ง์ ํ์ฉ ๊ฐ๋ฅ
- CLIP ๋ด๋ถ ๋ฉ์ปค๋์ฆ ์ดํด์ ํฅํ ๋ชจ๋ธ ๊ฐ์ ์ ์ค์ํ ํต์ฐฐ ์ ๊ณต