🔎 VL-SAM: Training-Free Open-Ended Object Detection & Segmentation

Posted Sep 6, 2025

By DrFirst

20 min read

Title: Training-Free Open-Ended Object Detection and Segmentation via Attention as Prompts
Conference: NeurIPS 2024
Keywords: Open-Ended Detection, Segmentation, SAM, VLM, Attention Map, Training-Free
Summary: By linking the attention map of a Vision-Language Model (VLM) to the prompt of Segment Anything Model (SAM), VL-SAM enables simultaneous detection and segmentation of unseen objects without additional training!

🚀 VL-SAM Key Summary

One-liner: “No labels, no training — just use VLM attention as prompts for SAM to detect and segment objects!”

1) Training-Free Open-Ended Framework

Combines VLM + SAM without extra training
Uses attention map as prompts (goes beyond Open-set: unlike open-set methods that require the category word, this does not!)

2) Attention Map Aggregation & Flow

Aggregates multi-head, multi-layer attention from VLM for high-quality maps
Mitigates collapse due to causal mask with regularized attention flow

3) Prompt Generation & Iterative Refinement

Samples positive/negative points from attention
Feeds into SAM, then iteratively refines with feedback

4) Generalization & Modularity

Applicable to various VLMs (MiniGPT-4, LLaVA, etc.) and SAM variants (MobileSAM, etc.)

🔍 Background

📑 Vision-Language Models (VLMs)

Beyond GPT-3, LLaMA → emergence of VLMs:
- BLIP-2: Q-Former module / aligns image-text embeddings with multiple pretraining losses
- LLaMA-Adapter / LLaVA / MiniGPT: adapters or projection layers / aligns vision features into LLM space / combines large LLM with vision
- CogVLM: introduces Visual Expert Modules / converts image features at transformer-head level
- SPHINX: supports multi-vision tasks with various mixing techniques
- CogAgent / LLaVA-Phi: defines VLMs as agents / supports multi-step reasoning / interactive and tool-use tasks
- GPT-4V: strong generalization, handles corner cases and complex real-world scenarios
BUT: Localization ability is still weaker than models like SAM.
→ Hence, combining VLM and SAM in a training-free way to show segmentation capability!
Extra note (Preliminaries):
- Segment Anything Model (SAM): bbox/point prompt-based segmentation model, composed of image encoder, prompt encoder, mask decoder.
- AR-based VLMs: Auto-Regressive, i.e., Transformer decoders with next-token prediction (e.g., GPT-4V, Qwen2.5VL).

Object Detection Paradigms

Open-Set: With CLIP, but still requires predefined categories (e.g., GLIP, GroundingDINO, SWORD, YOLO-World).
Open-Ended: Predict both object name + location with no predefined categories. GenerateU pioneered this, followed by DetCLIPv3. But requires large training data.
→ VL-SAM: First training-free open-ended detection + segmentation.

🧱 VL-SAM Architecture

Connects VLM (object recognition) with SAM (object localization).
[object recognition] Input image → VLM detects objects and produces attention map.
[object localization] Attention map → converted into point prompts for SAM → segmentation masks.

1) [object recognition] - Attention Map Generation (VLM)

Core idea: build SAM prompts directly from VLM attention!
a. Ask VLM to list all objects (Tag2Text-like object list extraction)
b. Save Q/K for all layers & heads during token generation
c. Compute (Q \times K^T), apply causal mask, Softmax → similarity matrix S
d. Compute head-layer weights (W = \text{Mean}(\max(S, \text{dim}=1), \text{dim}=0))
e. Apply W to S, get corrected (S’)
f. Aggregate across layers → final attention flow
g. Regularize attention flow (to prevent collapse in AR-based VLMs)

Finally: attention map is ready!

2) [object localization] - SAM Prompt Generation

Extract positive/negative points from attention map
- Positive: top values above threshold
- Negative: low values outside positive regions
Run SAM once, then refine by re-sampling points from the first result
Aggregate masks with NMS

3) Ensembles for Accuracy

Sub-image tiling for small objects
Multiple prompts: ask VLM multiple times and ensemble results

🔧 Evaluation & Results

CogVLM-17B + SAM (ViT-H)
Zero-shot evaluation on LVIS, CODA
For open-ended eval: embed generated object names with CLIP and match with dataset labels

🎯 LVIS

+3.4 APrare over GenerateU
Produces segmentation masks simultaneously, though not as strong as Mask R-CNN

🎯 CODA (corner-case detection)

Achieves 40.1 mAR (vs 18.4 mAR for LLaVA-Grounding)
74.1% of Oracle SAM upper bound (54.1 mAR)

🎯 Ablation Study

Each component (attention generation, prompt sampling, iterative refinement, multi-scale/question ensemble) contributes
Without regularized attention flow → collapse
Prompt sampling strategy improves segmentation quality
Multi-scale + question ensemble maximizes corner-case detection

✅ Conclusion

VL-SAM: first training-free open-ended detection + segmentation framework
Innovative design: connect VLM attention → SAM prompts
Enables label-free, training-free recognition, with potential applications in autonomous driving, robotics, safety-critical systems

(Appendix) Easy Example of Computing W – 🧮 N=3 (cat, dog, truck)

Equation (1):
( W = \text{Mean}(\max(S, \text{dim}=1), \text{dim}=0) )

( S \in \mathbb{R}^{N \times N \times H \times L} )
Here, we show a single head h in a single layer l: ( S^{h,l} \in \mathbb{R}^{N \times N} )

1) Single Head h, Single Layer l Example

Tokens: cat(1), dog(2), truck(3) → N=3
Similarity matrix (S^{h,l}):

\[S^{h,l} = \begin{bmatrix} 0.70 & 0.20 & 0.10 & \quad \text{(Query=cat)} \\ 0.10 & 0.60 & 0.30 & \quad \text{(Query=dog)} \\ 0.15 & 0.25 & 0.60 & \quad \text{(Query=truck)} \end{bmatrix}\]

(a) Max(S, dim=1)

cat row: 0.70
dog row: 0.60
truck row: 0.60

\[\max(S^{h,l}, \text{dim}=1) = \begin{bmatrix} 0.70 \\ 0.60 \\ 0.60 \end{bmatrix}\]

(b) Mean of those values

\(W_{h,l} = \frac{0.70 + 0.60 + 0.60}{3} = \mathbf{0.6333}\)

2) Same Layer l with Head 2 (H=2)

Example ( S^{h2,l} ):

\[S^{h2,l} = \begin{bmatrix} 0.40 & 0.30 & 0.30 \\ 0.35 & 0.35 & 0.30 \\ 0.34 & 0.33 & 0.33 \end{bmatrix}\]

Max by row → [0.40, 0.35, 0.34]
Mean → ( W_{h2,l} = 0.3633 )

So within layer l:
Head 1: (W_{h1,l} = 0.6333)
Head 2: (W_{h2,l} = 0.3633)
→ Head 1 is more “useful” and weighted higher.

3) Overall Shape & Broadcasting

Across all heads/layers:
\(W \in \mathbb{R}^{1 \times 1 \times H \times L}\)
Next step (Equation 2):
\(S' = \text{Mean}(S \odot W, \text{dim}=2)\)

4) Takeaway

Row-wise max (over Keys) = how strongly each Query focuses
Mean over Queries = head’s overall importance (W_{h,l})
Apply W then average over heads = emphasizes good heads, yielding higher-quality (S’)

🔎 (한국어) VL-SAM: 학습 없이 Open-Ended 객체 탐지·분할까지!

제목: Training-Free Open-Ended Object Detection and Segmentation via Attention as Prompts
학회: NeurIPS 2024
핵심 키워드: Open-Ended Detection, Segmentation, SAM, VLM, Attention Map, Training-Free
요약: Vision-Language Model(VLM)의 attention map을 Segment Anything Model(SAM)의 prompt로 연결해, 추가 학습 없이 unseen 객체 탐지와 분할을 동시에 수행!

🚀 VL-SAM 핵심 요약

한 줄 요약: “라벨도, 학습도 필요 없이, VLM의 attention으로 SAM을 불러 객체를 찾아낸다!”

1) Training-Free Open-Ended Framework

추가 학습 없이 VLM + SAM 조합
attention map을 prompt로 활용해 prompt 생성!! (Open-set보다 발전 - Openset은 단어를 제공해야하지만 야는 그것도 필요없다!)

2) Attention Map Aggregation & Flow

VLM의 multi-head, multi-layer attention을 모아 고품질 맵 생성
causal mask로 인한 collapse 문제를 regularized attention flow로 완화

3) Prompt Generation & Iterative Refinement

attention에서 positive/negative point 샘플링
SAM에 입력 후 결과를 다시 feedback하여 점진적 성능 향상

4) Generalization & Modularity

다양한 VLM(MiniGPT-4, LLaVA 등) 및 SAM 변형(MobileSAM 등)에 적용 가능

🔍 기존 연구의 흐름

📑 VLM - Vision-Language Model

GPT-3, LLaMA 등의 LLM을 넘어 Vision Language Model 들이 나옴!!
- BLIP-2**: Q-Former 모듈 도입 / 이미지·텍스트 임베딩 연결 및 융합 / 세 가지 alignment pretrain loss 사용
- LLaMA-Adapter / LLaVA / MiniGPT**: 어댑터(또는 projection layer) 활용 / 이미지 피처를 텍스트 임베딩 공간에 정렬 / 대규모 LLM과 비전 모달 결합
- CogVLM**: Visual Expert Modules 도입 / 이미지 피처를 transformer head 단위로 변환·정렬 / head별 세분화된 매핑 제공
- SPHINX**: 다중 시각 작업 지원 / 여러 mixing 기법 활용 / 다양한 태스크에 범용 적용 가능
- CogAgent / LLaVA-Phi**: VLM을 에이전트(assistant)로 정의 / 멀티스텝 reasoning 수행 / 대화형·도구 사용 기반 작업 처리
- GPT-4V**: 강력한 범용화 능력 / 새로운·희귀 상황(corner case) 이해 및 추론 가능 / 자율주행 등 복잡한 실제 시나리오 처리 가능
하지만!! Localization 역량은 SAM 같은 목적에 맞는 모델에 비해 성능이 떨어짐!
그래서!! 두 모델 Training Free로 연결하며 segmentation 능력을 보여주고자함!
- 더 알아둘 사항!!(Preliminary)
Segment Anything Model!!?
- SAM이 대표적이며 SAM이 아이디어를 얻은 MaskDINO도 있음! ㅇ
- bbox, point prompt 기반의 Segmentation 모델!!
- 3개의 주요 요소로 구성됨 : 이미지 인코더, 프롬포트 인코더, 마스크 디코더
- 이미지 인코딩 & 프롬포트 토큰 & 초기화 마스크 토큰을 마스크 디코더에 넣어서!! 최적의 mask token을 만들고, 이 mask token으로 여러 mask를 만듬
Auto-Regressive Based Vision-Language Model!!
- VLM은 AR-based와 non-AR로 나뉨!
- AR-based (Auto-Regressive base)란 무엇인가!? = Transformer decoder가 next-token prediction으로 동작하는 구조를 의미!
  - 다음 token을 predict 하는, decoding에 중점으로 자연어 텍스트 생성을 함! GPT-4V, Qwen2.5VL등 우리가 알고있는 VLM이 여기해당!
  - image encoder, text tokenizer, projection layers, language decoder로 구성됨!
  - 이미지와 텍스트를 받으면, 각각의 인코더를 통해 토큰을 받고, projection layers 로 이미지와 텍스트를 정렬하고 디코더로 가서 output을 생성한다!!
- 반면 non-AR은 “텍스트 생성”보다는 분류/정렬/매칭 중심, autoregressive LM 디코더가 핵심이 아니다!
  - 텍스트 생성외의 분류(CLIP), Mask 생성(SAM) 등에 강점을 가짐!

객채 탐지 (Object Detection)

Open-Set 방법: CLIP 모델 덕분에 Open-set으로 발전했지만, 이 역시도 사전에 탐지할 객채에 대한 정의 필요 → 한계
- 여기서 언급한 Open-Set 방법은? GLIP, GroundingDINO, SWORD, YOLO-World
Open-Ended 방법: 사전 카테고리 없이 객체 이름+위치 동시 예측
- GenerateU가 첫 Open-ended problem. 또한 DetCLIPv3v 등!!
- 하지만 기존 open-ended 모델은 대규모 학습 데이터/파인튜닝 필요
→ VL-SAM은 학습 없는(training-free) open-ended 탐지+분할 최초 제안!

🧱 VL-SAM 구조 (Architecture)

VLM와 SAM을 각각 object recognition, object localization 모델로서 서로 연결한다!
[object recognition]이미지 인풋이 들어오면, VLM이 이미지 내의 객채를 탐지하고!, attention generation module로 attention map을 만든다.
[object localization] attention map을 바탕으로 point ptompt를 만들고, SAM으로 보내서 segmentation 한다!!

1) [object recognition] - Attention Map Generation (VLM)

여기가 VL-SAM의 핵심 아이디어로!! SAM에 넣을 Object prompt를 만드는 부분!!
a. 우선 image를 기반으로 VLM에게 모든 객채를 나열하라고함!, 나열된것에서 Tag2Text 기법으로 object list 확보
b. 토큰 생성시에, 모든 레이어의, 모든 head의 query(Q), key(K) 에 대하여 저장해둠. 추출된 객채들의 Q, K를 불러와서, c. 저장된 Q, K를(많음. Layer 수 X Head 수) Q × Kᵀ 연산 한 다음 causal mask를 적용, SoftMax 로 표준화해서 similarity matrix(S) 를 구함
d. 모든 레이어의, 모든 head의 가중치가 다르기에 각각의 가중치 W를 구함
- 구하는 방법은 W = Mean(Max(S, dim = 1), dim = 0).
- 쉬운 예시! (맨 뒤에!)
  e. 구해진 가중치 W랑 S로 보정된 S′ 산출!
  f. Layer마다 산출된 S′를 바탕으로 최종 attention flow산출!
  g. 그떄 S′을 그냥 다 곱하면 AR-VLM의 특성 (Auto-Regressive 모델은 causal mask 적용됨) 상 collapse(왼쪽 상단에 모이는 문제)가 발생하게되고! 그래서 이를 막고자 Regularized attention flow column를 사용!!(왼쪽 위에 있는 attention 값들을 인위적으로 줄여줌줌)
  Finally. 최종적으로 attention map이 만들어짐!!!

d, e단계의 수식!!

e단계를 시각화!!

f단계를 시각화!!

g단계의 역할을 증명!!

2) [object localization] - SAM Prompt Generation

앞 단계에서 만들어진 attention map에서 positive&negative point를 추출해서 SAM 할것임
그런데, attention map이 완벽하지많은 않기에, 필터링이 필요!
- positive area : 임계값을 두고 그것보다 큰 영역, 그중에서도 제일 큰 값을 positive point
- negative area : positive area 외의 영억, 그중 제일 작은 값을 negative point
또한 SAM 결과가 잡음이 많을 수 있어 2번 반복해서 진행함!
- 첫번째로는 PerSAM 처럼, point 쌍으로 segmentation mask 여러번 생성
- 두번쨰로는 첫번쨰의 결과를 attention map으로 다시한번 마스크하고 positive/negative point 쌍을 새로 추출, 여러번 SAM에 해서 개선하고
- 마지막으로 이 결과들을 NMS 방법으로 집계

3) Ensembles - 정확한 답을 위해!!

해상도 낮은 이미지의 작은 부분은 못 잡을수 있어!
- 그래서 sub image로 쪼개서 각각 VL-SAM 작업해!!
프롬포트에 예민한데?
- 그래서 VLM에 물어봐서 10개 프롬포트 받고, 10개 프롬포트 결과를 모두 합쳐~

🔧 실제 시험방법(Training Recipe) 및 결과!

CogVLM-17B + SAM(ViT-Huge) 조합의 모델을 활용!
- CogVLM-17B 은 EVA2-CLIP-E (비전) + Vicuna (언어)의 구조
Zero-shot 방식으로 LVIS, CODA 등 데이터셋 평가
open-ended이기에 모델이 생성한 자유로운 객체명(open-vocab)을 CLIP 텍스트 인코더로 임베딩한 뒤, 데이터셋의 정해진 라벨들과 매칭해서 평가

🎯 LVIS (long-tail segmentation)

기존 open-ended 방법 GenerateU 대비 +3.4 APrare 향상
segmentation mask까지 동시에 제공, 다만 Mask R-CNN만큼의 성능은 안나옴옴

🎯 CODA (corner-case detection, 자율주행)

40.1 mAR 달성 (기존 LLaVA-Grounding: 18.4 mAR)
Oracle SAM upper bound(54.1 mAR)의 74.1% 수준 성능

🎯 Ablation Study

Attention generation, prompt sampling, iterative refinement, multi-scale & question ensemble 각각 성능 개선 기여
- Regularized attention flow 없을 시 attention collapse 발생
- Prompt sampling 전략이 segmentation 품질 개선
- Multi-scale + Question ensemble 조합 시 corner case 탐지 성능 극대화

✅ 결론

VL-SAM은 open-ended 객체 탐지·분할을 training-free로 최초 달성
VLM의 attention을 SAM prompt로 연결하는 혁신적 구조
라벨·학습 없는 범용 인식이 가능해, 자율주행/로보틱스/안전-critical 시스템 등 응용 가능성 큼

(추가) W 구하는 쉬운 예시 - 🧮 N=3 (cat, dog, truck) 예시로 보는 Head별 중요도 Weight W 계산

문헌 식 (1):
W = Mean(Max(S, dim = 1), dim = 0)

S ∈ ℝ^{N × N × H × L} (N=토큰 수, H=헤드 수, L=레이어 수)
여기서는 한 레이어(l 고정)에서 한 헤드(h)를 먼저 예로 듭니다: S^{h,l} ∈ ℝ^{N×N}

1) 한 개 Head(h), 한 개 Layer(l)에서의 예시

토큰: cat(1), dog(2), truck(3) → N=3
해당 head의 similarity 행렬(= Q×Kᵀ 후 softmax 결과라고 생각):

\[S^{h,l} = \begin{bmatrix} 0.70 & 0.20 & 0.10 & \quad \text{(Query=cat)} \\ 0.10 & 0.60 & 0.30 & \quad \text{(Query=dog)} \\ 0.15 & 0.25 & 0.60 & \quad \text{(Query=truck)} \end{bmatrix}\]

(a) Max(S, dim = 1) ← 열 방향(Key j) 최댓값

Query(cat) 행의 최대값: max(0.70, 0.20, 0.10) = 0.70
Query(dog) 행의 최대값: max(0.10, 0.60, 0.30) = 0.60
Query(truck) 행의 최대값: max(0.15, 0.25, 0.60) = 0.60

\[\text{Max}(S^{h,l}, \text{dim}=1) = \begin{bmatrix} 0.70 \\ 0.60 \\ 0.60 \end{bmatrix}\]

(b) Mean(…, dim = 0) ← 행 방향(Query i) 평균

\(W_{h,l} = \frac{0.70 + 0.60 + 0.60}{3} = \mathbf{0.6333}\)

이 값 W_{h,l} 가 “Layer l 의 Head h 중요도(집중도)”입니다.
직관: 각 Query가 한 Key에 얼마나 강하게 집중했는지를 행별 최대값으로 보고, 그걸 모든 Query에 대해 평균내 해당 Head의 대표 집중도로 삼음.

2) 같은 Layer(l)에 Head가 2개(H=2) 있다고 가정

두 번째 head의 S^{h2,l} 예:

\[S^{h2,l} = \begin{bmatrix} 0.40 & 0.30 & 0.30 \\ 0.35 & 0.35 & 0.30 \\ 0.34 & 0.33 & 0.33 \end{bmatrix}\]

Max by row → [0.40, 0.35, 0.34]
Mean of those → ( W_{h2,l} = (0.40 + 0.35 + 0.34)/3 = \mathbf{0.3633} )

결과적으로, 같은 레이어 l 안에서
Head 1 중요도 (W_{h1,l} = 0.6333)
Head 2 중요도 (W_{h2,l} = 0.3633)
→ Head 1이 더 “유용한” head로 판단되어 가중치가 더 큼.

3) 전체 모양과 브로드캐스팅

모든 head/레이어에 대해 위 계산을 하면
\(W \in \mathbb{R}^{1 \times 1 \times H \times L}\)
다음 단계 식 (2): \(S' = \text{Mean}(S \odot W, \text{dim}=2)\)
- (S \odot W): head 차원(H)에 대해 포인트와이즈 곱 (브로드캐스팅)
- 그 후 head 축 평균(dim=2) → head-가중 통합된 (S’ \in \mathbb{R}^{N \times N \times L})

4) 한 줄 요약

행별 최대값(Max over Keys) = 각 Query가 얼마나 강하게 집중했나
그 최대값들의 평균(Mean over Queries) = 해당 Head의 대표 집중도(W_{h,l})
W로 가중 후 Head 평균 → 품질 좋은 Head의 정보를 더 살린 S’ 완성

AI, Research

This post is licensed under CC BY 4.0 by the author.