🔎 CLIP Surgery: A Closer Look at the Explainability of Contrastive Language-Image Pre-training
🔎 (English) CLIP Surgery: Enhancing Explainability by Operating on CLIP!
- Title: A Closer Look at the Explainability of Contrastive Language-Image Pre-training (CLIP Surgery)
- Journal: Pattern Recognition (2025)
- Code: GitHub – CLIP Surgery
- Keywords:
CLIP,Explainability,CAM,Vision-Language,Open-Vocabulary - Summary: CLIP is a powerful vision-language model, but it often focuses on the background instead of the foreground or suffers from noisy activation. To address this, the authors propose Architecture Surgery and Feature Surgery, significantly improving explainability!
🚀 Key Summary of CLIP Surgery
One-liner: “Without extra training, only structural and feature surgeries enhance CLIP’s explainability!”
1) Proving CLIP’s Issues: Found problems of inconsistency in self-attention and redundant features.
2) Feature Surgery: Successfully removed redundant features and suppressed unnecessary noisy activation → produced clean CAMs!!
3) Training-Free Explainability: No fine-tuning required, explainability secured with original CLIP → versatile applications!!
🔍 Research Background
- CAM, Grad-CAM: Effective for CNN/ViT but fail on CLIP.
In CLIP they are noisy and produce opposite visualization. In other words, localization fails!!
- Why do they fail in CLIP?
- Because self-attention links inconsistent semantic regions, and redundant features emphasize background rather than foreground.
a. Why does self-attention link inconsistent semantic regions?
a-1. CLIP was only trained for global image–text matching, so attention didn’t need to focus precisely on object interiors.
a-2. CLIP’s Query, Key, and Value parameters are different (heterologous), so Q/K relations connect inconsistent semantic areas.A_raw = σ(s · QK_⊤)V : heterologous parameters
A_con = σ(s · VV_⊤)V : homogeneous parametersb. Why do redundant features cause noise?
b-1. CLIP trains on many categories at once, so shared features (e.g., “sky”, “grass”, “road”) frequently appear.
b-2. These generic features are often in the background, so self-attention is easily pulled toward them, causing noisy activation.
- Alignment-based approaches exist but require extra models, layers, or fine-tuning (not training-free).
- ECLIP: Realigns CLIP features with segmentation masks using self-supervision.
- RCLIP: Uses bounding box annotations to refine CLIP’s image–text features per object.
- Both require retraining (fine-tuning).
🧱 CLIP Surgery Architecture
i) Architecture Surgery (Fixing Structural Issues)
- Raw self-attention (i-1) connects inconsistent semantic regions.
- Consistent self-attention (i-2) prevents unnecessary background emphasis.
mFSR measures how much self-attention focuses on the foreground (object).
i-1) raw self-attention : A_raw = σ(s · QK_T)V
i-2) consistent self-attention : A_con = σ(s · VV_⊤)V
- Code example: Transformer
Attentionforward pass
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
# i-1) Raw Self-attention
def forward(self, x):
B, N, C = x.shape
qkv = self.qkv(x).reshape(B, N, 3, self.num_heads, C // self.num_heads).permute(2, 0, 3, 1, 4)
q, k, v = qkv[0], qkv[1], qkv[2]
# original self-attention for the original path
attn_ori = (q @ k.transpose(-2, -1)) * self.scale
attn_ori = attn_ori.softmax(dim=-1)
attn_ori = self.attn_drop(attn_ori)
# i-2) consistent Self-attention
# replace k & q by v
k = v
q = k
attn = (q @ k.transpose(-2, -1)) * scale
attn = (attn).softmax(dim=-1)
attn = self.attn_drop(attn)
## return both paths
x_ori = (attn_ori @ v).transpose(1, 2).reshape(B, N, C)
x = (attn @ v).transpose(1, 2).reshape(B, N, C) # clip_surgery
x = self.proj_drop(self.proj(x))
x_ori = self.proj_drop(self.proj(x_ori))
return [x, x_ori]
- Dual paths: one for CLIP embeddings, another for CAM generation. FFN skipped to avoid negative effects.
- In the figure, FFN inside CLIP Transformer focuses incorrectly, and early self-attention blocks are also inaccurate.
- Thus, in the new path, only self-attention is applied (FFN skipped).
The original path is kept to preserve embeddings.
- Code example:
ResidualAttentionBlockforward
1
2
3
4
5
6
7
8
9
10
11
def forward(self, x):
# dual paths for blocks deeper than "d"
if isinstance(self.attn, Attention):
if isinstance(x, list):
x, x_ori = x ## x_ori = original path, x = new path
x_res = self.attention(self.ln_1(x_ori))
x_res, x_ori_res = x_res ## consistent vs raw self-attention
x_ori += x_ori_res
x_ori = x_ori + self.mlp(self.ln_2(x_ori)) # original path adds FFN
x += x_res # new path only adds consistent self-attention
return [x, x_ori]
ii) Feature Surgery (Fixing Representational Issues)
- CLIP learns many categories at once, leading to redundant shared features.
- Example: when target = “dog”, embeddings of cat, sky, sea, airplane also overlap.
Small L1 distance means positive and empty overlap → redundancy problem!!
- Code: clip_feature_surgery subtracts redundant_feats from all features.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
# in demp.py, define all_texts and target_texts
all_texts = [...]
target_texts = ['dog']
with torch.no_grad():
image_features = model.encode_image(image)
image_features = image_features / image_features.norm(dim=-1, keepdim=True)
text_features = clip.encode_text_with_prompt_ensemble(model, all_texts, device)
similarity = clip.clip_feature_surgery(image_features, text_features)
similarity_map = clip.get_similarity_map(features[:, 1:, :], cv2_img.shape[:2])
for b in range(similarity_map.shape[0]):
for n in range(similarity_map.shape[-1]):
if all_texts[n] not in target_texts:
continue
vis = (similarity_map[b, :, :, n].cpu().numpy() * 255).astype('uint8')
vis = cv2.applyColorMap(vis, cv2.COLORMAP_JET)
vis = cv2_img * 0.4 + vis * 0.6
vis = cv2.cvtColor(vis.astype('uint8'), cv2.COLOR_BGR2RGB)
print('CLIP:', all_texts[n])
plt.imshow(vis)
plt.show()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
## clip.py clip_feature_surgery implementation
def clip_feature_surgery(image_features, text_features, redundant_feats=None, t=2):
if redundant_feats != None:
similarity = image_features @ (text_features - redundant_feats).t()
else:
prob = image_features[:, :1, :] @ text_features.t()
prob = (prob * 2).softmax(-1)
w = prob / prob.mean(-1, keepdim=True)
b, n_t, n_i, c = image_features.shape[0], text_features.shape[0], image_features.shape[1], image_features.shape[2]
feats = image_features.reshape(b, n_i, 1, c) * text_features.reshape(1, 1, n_t, c)
feats *= w.reshape(1, 1, n_t, 1)
redundant_feats = feats.mean(2, keepdim=True) # along cls dim
feats = feats - redundant_feats
similarity = feats.sum(-1)
return similarity
🧪 Experimental Results
🎯 Explainability Benchmarks
- On VOC 2012, COCO, PascalContext:
- mIoU +22–36% improvement
- mSC +48–65% improvement
🎯 Open-Vocabulary Tasks
- Semantic Segmentation: best training-free method (PascalContext mIoU 29.3%)
- Multi-label Recognition: +11.61% mAP over CLIP on NUS-Wide
- Interactive Segmentation: replace manual labels by converting text → points for SAM
- Multimodal Visualization: interpret CLIP’s training itself
[end]token most often activated; non-object words like “in”, “.”, “of” also highly active!!- Suggests redundant tokens in CLIP’s vocabulary.
- Provides ideas for improving CLIP training in the future.
👀 Qualitative Comparison
- Original CLIP: emphasizes background + noisy
- CLIP Surgery: sharp, object-focused heatmaps
- → Clear improvement vs Grad-CAM, Bi-Modal, gScoreCAM
🧪 Ablation Study
- Only Architecture Surgery (i) → mSC +47.88%
- Add Feature Surgery (ii) → extra +3.17%
- Without dual paths → collapse occurs, proving it’s essential.
✅ Conclusion
- CLIP Surgery solves CLIP’s fundamental explainability issues (opposite visualization, noisy activation).
- A training-free approach strengthens CAM-based interpretation.
- Directly applicable to downstream tasks like Semantic Segmentation, Multi-label Recognition, Interactive Segmentation.
- Provides key insights for understanding CLIP internals and guiding future model improvements.
🔎 (한국어) CLIP Surgery: CLIP을 수술해서 설명 가능성을 높이다!
- 제목: A Closer Look at the Explainability of Contrastive Language-Image Pre-training (CLIP Surgery)
- 저널: Pattern Recognition (2025)
- 코드: GitHub – CLIP Surgery
- 핵심 키워드:
CLIP,Explainability,CAM,Vision-Language,Open-Vocabulary - 요약: CLIP은 강력한 비전-언어 모델이지만, foreground 대신 background에 집중하거나 잡음 활성화(noisy activation) 문제가 존재. 이를 해결하기 위해 Architecture Surgery와 Feature Surgery를 적용해 설명가능성을 크게 개선하는 프레임워크 제안!
🚀 CLIP Surgery 핵심 요약
한 줄 요약: “추가 학습 없이, 구조·특징 수술만으로 CLIP의 설명가능성을 강화한다!”
1) CLIP 의 문제 증명 : 기존 self-attention의 불일관성, 중복 특징의 문제들을 발견
2) Feature Surgery : 중복(redundant) 특징 제거 및 불필요한 noisy activation 억제 성공해서 CAM 만듬!!
3) Training-Free Explainability : Fine-tuning이 불필요, 원본 CLIP 그대로 설명가능성 확보하였기에 다양하게 활용 가능!!
🔍 기존 연구의 흐름
- CAM, Grad-CAM: CNN/ViT에는 효과적이지만 CLIP에는 적용 불가
CLIP에서는 ‘noisy’하고, ‘Opposite visualization’하다. 즉 localization에 문제가있다!!
- 그럼, CLIP에서는 왜 안됬을까?
- Self-attention 구조가 일관되지 않은 의미 영역을 연결하고, 중복 특징 때문에 잡음이 발생해 foreground 대신 background를 강조하기 때문
a. Self-attention 구조가 일관되지 않은 의미 영역을 연결하는 이유는? a-1. CLIP은 이미지–텍스트 쌍의 전역적(global) 매칭만 학습했기에 attention이 세밀하게 객체 내부에만 집중할 필요가 없었음!!
a-2. CLIP의 Query, Key, Value 파라미터가 서로 달라서(heterologous parameters) Query/Key가 만든 관계가 일관되지 않은 의미 영역을 연결A_raw = σ(s · QK_⊤)V : heterologous parameters
A_con = σ(s · VV_⊤)V : homogeneous parameters
b. 중복 특징 때문에 잡음이 발생하는 이유는? b-1. CLIP은 많은 카테고리를 동시에 학습, 클래스 간 공유되는 특징(예: “하늘”, “풀”, “도로”)이 자주 등장 b-2. 범용적 특징이 배경에 주로 깔려 있어서, self-attention이 쉽게 배경으로 끌려가 noisy activation이 발생
- Self-attention 구조가 일관되지 않은 의미 영역을 연결하고, 중복 특징 때문에 잡음이 발생해 foreground 대신 background를 강조하기 때문
- Alignment 기반 기법도 있지만 이는 추가적인 모델, 레이어, 혹은 파인튜닝을 필요로함!(Not Traininig Free)
- ECLIP: CLIP feature와 segmentation mask를 self-supervised로 다시 정렬(alignment)
- 원래 CLIP이 localization을 직접 못하므로, mask 정보를 추가 학습하여 보완
- RCLIP: Bounding box annotation을 활용해 CLIP의 이미지–텍스트 feature를 객체 단위로 보정
- 결국 CLIP을 재학습(fine-tuning)하는 방식
- ECLIP: CLIP feature와 segmentation mask를 self-supervised로 다시 정렬(alignment)
🧱 CLIP Surgery 구조 (Architecture)
i) Architecture Surgery(구조적 문제 개선)
- Raw Self-attention(i-1)이 일관되지 않은 의미 영역을 연결하는 문제가 있는데,
- Consistent self-attention(i-2)으로 불필요한 배경 강조 방지
i-1) raw self-attention : A_raw = σ(s · QK_T)V i-2) consistent self-attention : A_con = σ(s · VV_⊤)V
- 이를 코드로 보면 Transformer Attention 부분인
Attentionforward 부분에서,1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
# i-1) Raw Self-attention def forward(self, x): B, N, C = x.shape qkv = self.qkv(x).reshape(B, N, 3, self.num_heads, C // self.num_heads).permute(2, 0, 3, 1, 4) q, k, v = qkv[0], qkv[1], qkv[2] # original self-attention for the original path attn_ori = (q @ k.transpose(-2, -1)) * self.scale attn_ori = attn_ori.softmax(dim=-1) attn_ori = self.attn_drop(attn_ori) # i-2) consistent Self-attention # replace k & q by v k = v q = k attn = (q @ k.transpose(-2, -1)) * scale attn = (attn).softmax(dim=-1) attn = self.attn_drop(attn) ## 마무리!! 둘다 사용할 수 있도록 return x_ori = (attn_ori @ v).transpose(1, 2).reshape(B, N, C) x = (attn @ v).transpose(1, 2).reshape(B, N, C) # clip_surgery #x = v.transpose(1, 2).reshape(B, N, C) # mask_clip x = self.proj_drop(self.proj(x)) x_ori = self.proj_drop(self.proj(x_ori)) return [x, x_ori]
Dual Paths로 CLIP 임베딩용과 CAM 이미지 제작용 두개 피처 추출, FFN의 부정적 영향 최소화
- 위 이미지를 보면, CLIP Transformer내에서 FFN은 오히려 이상한대를 집중하고, 초기 Self-Attention 블록 부분도 부정확하다!
- 그렇기에
새 경로에서는 self-attention만 적용하고 FFN은 스낍!! 한편
원래 경로도 유지해서 CLIP 본래 임베딩 보존- 이를 코드로 보면, Transformer layer인
ResidualAttentionBlockforward 부분에서,1 2 3 4 5 6 7 8 9 10 11
def forward(self, x): # dual paths for blocks deeper than "d" if isinstance(self.attn, Attention): if isinstance(x, list): x, x_ori = x ## x_ori는 원래경로, x 는 새로운 경로! x_res = self.attention(self.ln_1(x_ori)) x_res, x_ori_res = x_res ## self-attention 결과 반환. x_res 는 `consistent self-attention`, x_ori_res는 `raw self-attention` x_ori += x_ori_res # 원래 경로는 self attention(x_ori_res)을 더하고 x_ori = x_ori + self.mlp(self.ln_2(x_ori)) # 원래 경로에 FFN(x_ori)를 더함!! x += x_res # 새로운 경로는 self attention(x_ori) 만 더함 return [x, x_ori]
ii) Feature Surgery(표현적 문제 개선)
- CLIP은 많은 카테고리를 동시에 학습, 클래스 간 공유되는 특징이 많은 문제가 있는데,
강아지가 타겟이더라도 고양이, 하늘, 바다, 비행기 등 기타 텍스트의 임베딩도 해보고, 중복되는 부분을 제거L1 이 작으면 유사하다는 뜻. 즉 positive랑 empty가 유사해버림! 그래서 중복의 문제가 발생!!
- 코드로 보면!! clip_feature_surgery 부분에서 전체에서 겹치는 부의를 가지고 redundant_feats를 만들고 각 feats에서 redundant_feats를 빼줌
```pythondemp.py 에서 미리 empty에 해당하는 all_text를 선언해두고 target_text도 둔다음
all_texts = [‘airplane’, ‘bag’, ‘bed’, ‘bedclothes’, ‘bench’, ‘bicycle’, ‘bird’, ‘boat’, ‘book’, ‘bottle’, ‘building’, ‘bus’, ‘cabinet’, ‘car’, ‘cat’, ‘ceiling’, ‘chair’, ‘cloth’, ‘computer’, ‘cow’, ‘cup’, ‘curtain’, ‘dog’, ‘door’, ‘fence’, ‘floor’, ‘flower’, ‘food’, ‘grass’, ‘ground’, ‘horse’, ‘keyboard’, ‘light’, ‘motorbike’, ‘mountain’, ‘mouse’, ‘person’, ‘plate’, ‘platform’, ‘potted plant’, ‘road’, ‘rock’, ‘sheep’, ‘shelves’, ‘sidewalk’, ‘sign’, ‘sky’, ‘snow’, ‘sofa’, ‘table’, ‘track’, ‘train’, ‘tree’, ‘truck’, ‘tv monitor’, ‘wall’, ‘water’, ‘window’, ‘wood’] target_texts = [‘dog’]
with torch.no_grad(): # Extract image features image_features = model.encode_image(image) image_features = image_features / image_features.norm(dim=-1, keepdim=True)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
# Prompt ensemble for text features with normalization
text_features = clip.encode_text_with_prompt_ensemble(model, all_texts, device)
# Similarity map from image tokens with min-max norm and resize, B,H,W,N
# 여기서 clip_feature_surgery 진행!!
similarity = clip.clip_feature_surgery(image_features, text_features)
similarity_map = clip.get_similarity_map(features[:, 1:, :], cv2_img.shape[:2])
for b in range(similarity_map.shape[0]):
for n in range(similarity_map.shape[-1]):
if all_texts[n] not in target_texts:
continue
## 여기서
vis = (similarity_map[b, :, :, n].cpu().numpy() * 255).astype('uint8')
vis = cv2.applyColorMap(vis, cv2.COLORMAP_JET)
vis = cv2_img * 0.4 + vis * 0.6
vis = cv2.cvtColor(vis.astype('uint8'), cv2.COLOR_BGR2RGB)
print('CLIP:', all_texts[n])
plt.imshow(vis)
plt.show() ```
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
## clip.py clip_feature_surgery 부분에서 전체에서 겹치는 부의를 가지고 redundant_feats를 만들고 각 feats에서 redundant_feats를 빼줌
def clip_feature_surgery(image_features, text_features, redundant_feats=None, t=2):
if redundant_feats != None:
similarity = image_features @ (text_features - redundant_feats).t()
else:
# weights to restrain influence of obvious classes on others
prob = image_features[:, :1, :] @ text_features.t()
prob = (prob * 2).softmax(-1)
w = prob / prob.mean(-1, keepdim=True)
# element-wise multiplied features
b, n_t, n_i, c = image_features.shape[0], text_features.shape[0], image_features.shape[1], image_features.shape[2]
feats = image_features.reshape(b, n_i, 1, c) * text_features.reshape(1, 1, n_t, c)
feats *= w.reshape(1, 1, n_t, 1)
redundant_feats = feats.mean(2, keepdim=True) # along cls dim
feats = feats - redundant_feats
# sum the element-wise multiplied features as cosine similarity
similarity = feats.sum(-1)
return similarity
🧪 실험 평가 결과
🎯 Explainability Benchmarks
- VOC 2012, COCO, PascalContext 등에서
- mIoU +22~36% 개선
- mSC +48~65% 개선
🎯 Open-Vocabulary Tasks
- Semantic Segmentation: training 없는 방법 중 최고 성능 (예: PascalContext mIoU 29.3%)
- Multi-label Recognition: NUS-Wide에서 CLIP 대비 +11.61% mAP 향상
- Interactive Segmentation: SAM에 텍스트→포인트 변환으로 수작업 라벨 대체
- Multimodal Visualization: CLIP의 학습 과정 자체 해석 가능
[end]토큰이 가장 흔히 활성화된 텍스트 토큰이며, “in”, “.”, “of” 같은 비객체(non-object) 단어도 높은 반응을 보임!!- 이는 CLIP의 어휘 사전에 불필요한 중복 토큰(redundant tokens)이 존재함을 시사
- 향후 CLIP 학습 과정 개선 아이디어를 제공!!
👀 정성 비교 : 잘하는구만!
- 원본 CLIP: background 강조 + 잡음 다수
- CLIP Surgery: 선명하고 객체 중심 heatmap 생성
- → 기존 Grad-CAM, Bi-Modal, gScoreCAM 대비 확연히 향상된 시각화 품질
🧪 Ablation 분석
- Architecture Surgery(i)만 적용 → mSC +47.88%
- Feature Surgery(ii) 추가 → 추가 +3.17% 향상
- Dual Paths 없으면 collapse 발생, 핵심 모듈임을 검증
✅ 결론
- CLIP Surgery는 CLIP 모델의 근본적 설명가능성 문제(opposite visualization, noisy activation)를 해결
- 추가 학습 없는 training-free 접근으로 CAM 기반 해석 강화
- Semantic Segmentation, Multi-label Recognition, Interactive Segmentation 등 다양한 다운스트림 작업에 직접 활용 가능
- CLIP 내부 메커니즘 이해와 향후 모델 개선에 중요한 통찰 제공