📍 GEM: Grounding Everything in Vision-Language Transformers

Posted Sep 8, 2025

By DrFirst

15 min read

📍 GEM: Unlocking the Latent Localization Ability of VLMs!

Title: Grounding Everything: Emerging Localization Properties in Vision-Language Transformers
Conference: CVPR 2024
Code/Checkpoints: GitHub – GEM
Keywords: Training-Free, Grounding, Vision-Language Transformer, Self-Self Attention, Zero-Shot
Summary: Feels like an extended version of CLIP Surgery! Proposes GEM, a framework that leverages the inherent attention structure of pretrained Vision-Language Transformers (VLMs) to perform object localization and segmentation in a training-free manner!

🚀 GEM Key Summary

One-liner: CLIP Surgery + (1) Attention Expansion + (2) Regularization

1) Self-Self Attention Expansion

CLIP Surgery only used value–value (v–v) attention
GEM extends this to include query–query (q–q), key–key (k–k) → utilizes full self–self attention

2) Regularization

CLIP Surgery had no normalization concept
GEM introduces three components for more stable and generalized localization:
i) Adaptive temperature: adaptively adjusts softmax temperature for each dataset/model
ii) L2 normalization: removes influence from token magnitude differences
iii) Iterative self–self attention: repeats clustering multiple times for reinforcement

3) Training-Free Grounding with Zero-Shot Localization & Segmentation

Directly extracts localization ability from pretrained VLMs
Achieves performance comparable to fine-tuned detectors
Open-vocabulary grounding without additional training

🔍 Flow of Existing Research

1. Localization-first approaches

Idea: first detect regions or masks, then label them using VL models
Examples:
- OpenSeg: fine-tuned with class-agnostic masks + image-text pairs
- OVSeg: segmentation model + CLIP for mask classification
- MaskCLIP(3): mask proposal network + CLIP encoder
- GroundingSAM: GroundingDINO (detector) + SAM (masking)

2. Modifying VL model architecture/training

Idea: alter ViT to encourage localization properties
Examples:
- SegCLIP, GroupViT: insert grouping blocks
- ViL-Seg, OVSegmentor: clustering / Slot Attention
- ReCo: retrieval-based fine supervision
- PACL: add a decoder with grounding loss on top of CLIP

3. Training-free adaptation

Idea: adapt pretrained VL models for localization without training
Examples:
- MaskCLIP: remove final MLP, use value projection
- CLIP Surgery: add surgery pathway to ViT backbone (v–v attention with residual)

➡️ GEM’s core concept: extending the training-free CLIP Surgery approach!

🧱 GEM Architecture

Easier to understand through code than images!

  
class SelfSelfAttention(nn.Module):
    def __init__(self, dim, num_heads=8, qkv_bias=False, qk_scale=None, attn_drop=0., proj_drop=0., ss_attn_iter=1,
                 ss_attn_temp=None):
        super().__init__()
        self.num_heads = num_heads
        head_dim = dim // num_heads
        self.scale = qk_scale or head_dim ** -0.5
        self.ss_attn_iter = ss_attn_iter
        self.ss_attn_temp = ss_attn_temp

        self.qkv = nn.Linear(dim, dim * 3, bias=qkv_bias)
        self.attn_drop = nn.Dropout(attn_drop)
        self.proj = nn.Linear(dim, dim)
        self.proj_drop = nn.Dropout(proj_drop)

    def forward(self, x, attn_bias=None, prev_attn=None):
        x = x.transpose(0, 1)
        B, N, C = x.shape
        qkv = self.qkv(x).reshape(B, N, 3, self.num_heads, C // self.num_heads).permute(2, 0, 3, 1, 4)
        q, k, v = qkv[0], qkv[1], qkv[2]
        self.v_values = v
        # original self-attention for the original path
        attn_ori_return = (q @ k.transpose(-2, -1)) * self.scale
        attn_ori = attn_ori_return.softmax(dim=-1)
        attn_ori = self.attn_drop(attn_ori)

        x_ori = (attn_ori @ v).transpose(1, 2).reshape(B, N, C)
        x_ori = self.proj_drop(self.proj(x_ori))

        # GEM
        xs1 = v
        xs2 = k
        xs3 = q

        #  >>> i) Adaptive temperature: `inv_temp`
        if self.ss_attn_temp is None:
            pre_norm = torch.norm(x, dim=-1).mean(dim=-1, keepdim=True).unsqueeze(1).unsqueeze(-1)
            inv_temp = pre_norm * self.scale
        else:
            inv_temp = self.ss_attn_temp

        # >>> iii) Iterative self–self attention
        for it in range(self.ss_attn_iter):

          #   >>> ii) L2 normalization
            xs1 = F.normalize(xs1, dim=-1)
            xs2 = F.normalize(xs2, dim=-1)
            xs3 = F.normalize(xs3, dim=-1)

            attn_return1 = (xs1 @ xs1.transpose(-2, -1)) * inv_temp
            attn_return2 = (xs2 @ xs2.transpose(-2, -1)) * inv_temp
            attn_return3 = (xs3 @ xs3.transpose(-2, -1)) * inv_temp

            attn1 = (attn_return1).softmax(dim=-1)
            attn2 = (attn_return2).softmax(dim=-1)
            attn3 = (attn_return3).softmax(dim=-1)

            xs1 = attn1 @ xs1
            xs2 = attn2 @ xs2
            xs3 = attn3 @ xs3

        # Assignment to V
        xs1 = F.normalize(xs1, dim=-1)
        xs2 = F.normalize(xs2, dim=-1)
        xs3 = F.normalize(xs3, dim=-1)

        attn_return1 = (xs1 @ xs1.transpose(-2, -1)) * inv_temp
        attn_return2 = (xs2 @ xs2.transpose(-2, -1)) * inv_temp
        attn_return3 = (xs3 @ xs3.transpose(-2, -1)) * inv_temp

        attn1 = (attn_return1).softmax(dim=-1)
        attn2 = (attn_return2).softmax(dim=-1)
        attn3 = (attn_return3).softmax(dim=-1)

        xs1 = attn1 @ v
        xs2 = attn2 @ v
        xs3 = attn3 @ v

        ## >>> iiii : qkv ensemble!!!
        xs = (xs1 + xs2 + xs3) / 3

        x = xs.transpose(1, 2).reshape(B, N, C)
        x = self.proj_drop(self.proj(x))

        return [x.transpose(0, 1), x_ori.transpose(0, 1)]

1) Self-Self Attention Expansion

As seen in the code:
xs1 = v-v, xs2 = k-k, xs3 = q-q

2) Regularization
i) Adaptive temperature
ii) L2 normalization
iii) Iterative self–self attention
iiii) qkv-Ensemble (averaging all results)

🧪 Experiments & Results

🎯 Segmentation & Localization Benchmarks

On complex datasets like PascalContext, ADE20K:
- Significantly better than previous training-free methods
- Comparable or superior to fine-tuned approaches

🎯 Zero-Shot Point Prediction (OpenImages V7)

First training-free SOTA
Demonstrates localization without LLM/VLM hybrid models
Downside: inference FPS is quite slow

👀 Qualitative Results

Comparison with other models!

Methods trained with localization (GroundingSAM, OVSeg)
- Strength: high-quality masks if object correctly identified (e.g., Cat, Squirrel, Jet Ski)
- Weakness: fail to detect entities not in datasets (e.g., Boxer, Violin)
- Cause: reliance on handcrafted segmentation annotation → limited scope
Segmentation-specialized training methods (GroupViT, SegCLIP)
- Strength: accurate for common objects (e.g., Cat, Squirrel, Lizard)
- Weakness: fail on rare objects (e.g., Jet Ski, Logo, Flag)
- Cause: limited curated vocab → reduced diversity
Training-free methods (MaskCLIP, CLIPSurgery, GEM)
- Strength: leverage millions of image-text pairs from VLM pretraining → recognize diverse entities
- Weakness: masks less sharp than GroundingSAM
- GEM’s extra achievement:
  - Sharper segmentation compared to other training-free approaches (clearer contours, fewer holes)
  - Detects objects missed by MaskCLIP & CLIPSurgery (e.g., Logo)

Applicable not only to CLIP but to other VLMs as well!

Failure cases show strong dependence on text prompts!

🧪 Ablation Analysis

Comparison with CLIP Surgery: adding k-k, q-q, normalization, etc. leads to improvements
Effect of normalization: clear benefit, requires proper 1/T value
Effect of iterations:
- More iterations → beneficial for datasets with fewer classes (VOC)
- Fewer iterations → better for complex datasets with many classes (Context)

✅ Conclusion

GEM reveals the latent localization ability of Vision-Language Transformers
Potential to replace fine-tuned detectors when combined with larger VLMs
Introduces a new paradigm for open-world recognition, segmentation, and grounding!

📍 (한국어) GEM: VLM이 가진 잠재적 Localization 능력을 끌어내다!

제목: Grounding Everything: Emerging Localization Properties in Vision-Language Transformers
학회: CVPR 2024
코드/체크포인트: GitHub – GEM
핵심 키워드: Training-Free, Grounding, Vision-Language Transformer, Self-Self Attention, Zero-Shot
요약: CLIP Surgery의 확장판느낌!! 사전 학습된 Vision-Language Transformer(VLM)의 내재된 attention 구조를 활용해, 추가 학습 없이(training-free) 객체 위치 인식과 분할까지 수행하는 프레임워크 GEM 제안!

🚀 GEM 핵심 요약

한 줄 요약: CLIP Surgery에 1. Attention 확장 및 2. Regularization 도입

1) Self-Self Attention 확장

CLIP Surgery 에서는 오직 value–value (v–v) attention만 사용
그런데 GEM은!! v–v뿐만 아니라 query–query (q–q), key–key (k–k)까지 확장 → self–self attention 전반 활용

2) Regularization 도입

CLIP Surgery에는 정규화 개념이 없음
GEM은 안정적이고 일반화된 localization을 위해 세가지 요소를 도입!!
i) Adaptive temperature: 데이터셋/모델마다 적응적으로 softmax 온도 조정
ii) L2 정규화: 토큰의 크기(norm) 차이로 생기는 영향 제거
iii) Iterative self–self attention: 필요 시 여러 번 반복하여 클러스터링 강화

3) Training-Free Grounding 이면서 Zero-Shot Localization & Segmentation

사전 학습된 VLM에서 바로 localization 성능 추출
fine-tuned detector 수준에 맞먹는 성능
추가 학습 없이 open-vocabulary grounding 달성

🔍 기존 연구의 흐름

1. Localization-first 접근

아이디어: 먼저 영역(Region)이나 마스크를 찾은 뒤 VL 모델로 라벨링
예시:
- OpenSeg: class-agnostic mask + image-text pair로 파인튜닝
- OVSeg: segmentation model + CLIP으로 마스크 분류
- MaskCLIP(3): 마스크 제안 네트워크 + CLIP 인코더
- GroundingSAM: GroundingDINO(검출) + SAM(마스크)

2. VL 모델 구조/학습 수정 접근

아이디어: ViT 구조를 바꾸어 localization 특성을 유도
예시:
- SegCLIP, GroupViT: grouping block 삽입
- ViL-Seg, OVSegmentor: clustering/Slot Attention 활용
- ReCo: retrieval 기반 미세 감독
- PACL: CLIP 위에 decoder + grounding loss

3. Training-free 적응 접근

아이디어: 학습 없이 기존 VL 모델을 localization에 맞게 변형
예시:
- MaskCLIP: 마지막 MLP 제거, value projection 활용
- CLIP Surgery: ViT 백본에 surgery pathway 추가 (value–value attention 사용, residual로 누적)
이 중에서 Training-Free인 3번의 CLIP Surgery를 확장하는 것이 GEM의 기본개념!!

🧱 GEM 구조 (Architecture)

여긴 이미지보다 코드로 보는개 편함!!!

  
class SelfSelfAttention(nn.Module):
    def __init__(self, dim, num_heads=8, qkv_bias=False, qk_scale=None, attn_drop=0., proj_drop=0., ss_attn_iter=1,
                 ss_attn_temp=None):
        super().__init__()
        self.num_heads = num_heads
        head_dim = dim // num_heads
        self.scale = qk_scale or head_dim ** -0.5
        self.ss_attn_iter = ss_attn_iter
        self.ss_attn_temp = ss_attn_temp

        self.qkv = nn.Linear(dim, dim * 3, bias=qkv_bias)
        self.attn_drop = nn.Dropout(attn_drop)
        self.proj = nn.Linear(dim, dim)
        self.proj_drop = nn.Dropout(proj_drop)

    def forward(self, x, attn_bias=None, prev_attn=None):
        x = x.transpose(0, 1)
        B, N, C = x.shape
        qkv = self.qkv(x).reshape(B, N, 3, self.num_heads, C // self.num_heads).permute(2, 0, 3, 1, 4)
        q, k, v = qkv[0], qkv[1], qkv[2]
        self.v_values = v
        # original self-attention for the original path
        attn_ori_return = (q @ k.transpose(-2, -1)) * self.scale
        attn_ori = attn_ori_return.softmax(dim=-1)
        attn_ori = self.attn_drop(attn_ori)

        x_ori = (attn_ori @ v).transpose(1, 2).reshape(B, N, C)
        x_ori = self.proj_drop(self.proj(x_ori))

        # GEM
        xs1 = v
        xs2 = k
        xs3 = q

        #  >>> i) Adaptive temperature: `inv_temp`
        if self.ss_attn_temp is None:
            pre_norm = torch.norm(x, dim=-1).mean(dim=-1, keepdim=True).unsqueeze(1).unsqueeze(-1)
            inv_temp = pre_norm * self.scale
        else:
            inv_temp = self.ss_attn_temp

        # >>> iii) Iterative self–self attention: 반복함!!
        for it in range(self.ss_attn_iter):

          #   >>> ii) L2 정규화: 토큰의 크기(norm) 차이로 생기는 영향 제거  
            xs1 = F.normalize(xs1, dim=-1)
            xs2 = F.normalize(xs2, dim=-1)
            xs3 = F.normalize(xs3, dim=-1)

            attn_return1 = (xs1 @ xs1.transpose(-2, -1)) * inv_temp
            attn_return2 = (xs2 @ xs2.transpose(-2, -1)) * inv_temp
            attn_return3 = (xs3 @ xs3.transpose(-2, -1)) * inv_temp

            attn1 = (attn_return1).softmax(dim=-1)
            attn2 = (attn_return2).softmax(dim=-1)
            attn3 = (attn_return3).softmax(dim=-1)

            xs1 = attn1 @ xs1
            xs2 = attn2 @ xs2
            xs3 = attn3 @ xs3

        # Assigment to V
        xs1 = F.normalize(xs1, dim=-1)
        xs2 = F.normalize(xs2, dim=-1)
        xs3 = F.normalize(xs3, dim=-1)

        attn_return1 = (xs1 @ xs1.transpose(-2, -1)) * inv_temp
        attn_return2 = (xs2 @ xs2.transpose(-2, -1)) * inv_temp
        attn_return3 = (xs3 @ xs3.transpose(-2, -1)) * inv_temp

        attn1 = (attn_return1).softmax(dim=-1)
        attn2 = (attn_return2).softmax(dim=-1)
        attn3 = (attn_return3).softmax(dim=-1)

        xs1 = attn1 @ v
        xs2 = attn2 @ v
        xs3 = attn3 @ v

        ## >>> iiii : qkv ensemble!!!
        xs = (xs1 + xs2 + xs3) / 3

        x = xs.transpose(1, 2).reshape(B, N, C)
        x = self.proj_drop(self.proj(x))

        return [x.transpose(0, 1), x_ori.transpose(0, 1)]

1) Self-Self Attention 확장

아래 코드에서 볼수 있듯~!!
xs1 = v-v, xs2 = k-k , xs3= q-q

2) Regularization 도입
i) Adaptive temperature: 데이터셋/모델마다 적응적으로 softmax 온도 조정
ii) L2 정규화: 토큰의 크기(norm) 차이로 생기는 영향 제거
iii) Iterative self–self attention: 필요 시 여러 번 반복하여 클러스터링 강화
iiii) qkv-Ensemble: 마지막에 다 더해서 평균냄!!

🧪 실험 및 결과 분석

🎯 Segmentation & Localization Benchmarks

PascalContext, ADE20K 등 복잡한 레이블링 데이터셋에서
- 기존 training-free 방법 대비 월등한 성능
- fine-tuned 방법에 근접하거나 능가

🎯 Zero-Shot Point Prediction (OpenImages V7)

최초의 training-free SOTA 성능 달성
LLM/VLM 조합 없이도 localization 가능성 확인
다만 fps는 너무 느리다..

👀 결과 이미지 보기

다른모델들과의 결과비교이지미!!

Localization 정보로 학습한 방법 (GroundingSAM, OVSeg)
- 장점: 객체를 정확히 인식하면 고품질 마스크 출력 (예: Cat, Squirrel, Jet Ski)
- 한계: 데이터셋에 잘 안 나오는 개체(Boxer, Violin) 탐지 불가
- 원인: 수작업 segmentation annotation 의존 → 범위 제한
Segmentation 특화 학습 방법 (GroupViT, SegCLIP)
- 장점: 흔한 객체 잘 분할 (예: Cat, Squirrel, Lizard)
- 한계: 드문 개체(Jet Ski, Logo, Flag) 분할 실패
- 원인: 제한된 vocab curation으로 학습 → 어휘 다양성 부족
Training-free 방법 (MaskCLIP, CLIPSurgery, GEM)
- 장점: VL 모델이 학습한 수백만 개 이미지-텍스트 쌍 활용 → 다양한 개체 인식 가능
- 단점: 마스크 경계가 날카롭지 못함 (GroundingSAM보다 덜 정밀)
- GEM의 추가 성과:
  - 기존 training-free 대비 더 선명한 segmentation (경계 뚜렷, 구멍 적음)
  - MaskCLIP, CLIPSurgery가 놓친 객체까지 탐지 (예: Logo)

CLIP 뿐만 아니라 다른 VLM들에도 적용이 가능하다!

실패 케이스를 본다면 텍스트 프롬포트에 따라 차이가 컷다!!

🧪 Ablation 분석

ClipSurgery랑 비교 + k-k랑 q-q 추가하고 정규화 등을 추가하면서 좋아진다!
정규화 효과는!? : 있다!! 알맞은 1/T 값 가 필요하다!
반복의 효과는!? : 반복(iteration)을 늘리면 토큰들이 더 큰 클러스터로 묶여서 단순한 데이터셋(VOC)엔 유리하지만, 다양한 객체가 많은 데이터셋(Context)에서는 과도한 병합이 오히려 성능을 떨어뜨린다.

✅ 결론

GEM은 Vision-Language Transformer가 내재적으로 가진 localization 능력을 발굴한 연구
향후 더 큰 VLM과 결합 시, fine-tuned detector를 대체할 잠재력 보유
open-world recognition, segmentation, grounding의 새로운 패러다임 제시!

AI, Research

This post is licensed under CC BY 4.0 by the author.

📍 GEM: Unlocking the Latent Localization Ability of VLMs!

🚀 GEM Key Summary

🔍 Flow of Existing Research

1. Localization-first approaches

2. Modifying VL model architecture/training

3. Training-free adaptation

🧱 GEM Architecture

🧪 Experiments & Results

🎯 Segmentation & Localization Benchmarks

🎯 Zero-Shot Point Prediction (OpenImages V7)

👀 Qualitative Results

🧪 Ablation Analysis

✅ Conclusion

📍 (한국어) GEM: VLM이 가진 잠재적 Localization 능력을 끌어내다!

🚀 GEM 핵심 요약

🔍 기존 연구의 흐름

1. Localization-first 접근

2. VL 모델 구조/학습 수정 접근

3. Training-free 적응 접근

🧱 GEM 구조 (Architecture)

🧪 실험 및 결과 분석

🎯 Segmentation & Localization Benchmarks

🎯 Zero-Shot Point Prediction (OpenImages V7)

👀 결과 이미지 보기

🧪 Ablation 분석

✅ 결론

Trending Tags