Post

๐Ÿ”Ž VL-SAM: Training-Free Open-Ended Object Detection & Segmentation

๐Ÿ”Ž VL-SAM: Training-Free Open-Ended Object Detection & Segmentation

๐Ÿ”Ž VL-SAM: Training-Free Open-Ended Object Detection & Segmentation

Image

  • Title: Training-Free Open-Ended Object Detection and Segmentation via Attention as Prompts
  • Conference: NeurIPS 2024
  • Keywords: Open-Ended Detection, Segmentation, SAM, VLM, Attention Map, Training-Free
  • Summary: By linking the attention map of a Vision-Language Model (VLM) to the prompt of Segment Anything Model (SAM), VL-SAM enables simultaneous detection and segmentation of unseen objects without additional training!

๐Ÿš€ VL-SAM Key Summary

One-liner: โ€œNo labels, no training โ€” just use VLM attention as prompts for SAM to detect and segment objects!โ€

Image

1) Training-Free Open-Ended Framework

  • Combines VLM + SAM without extra training
  • Uses attention map as prompts (goes beyond Open-set: unlike open-set methods that require the category word, this does not!)

2) Attention Map Aggregation & Flow

  • Aggregates multi-head, multi-layer attention from VLM for high-quality maps
  • Mitigates collapse due to causal mask with regularized attention flow

3) Prompt Generation & Iterative Refinement

  • Samples positive/negative points from attention
  • Feeds into SAM, then iteratively refines with feedback

4) Generalization & Modularity

  • Applicable to various VLMs (MiniGPT-4, LLaVA, etc.) and SAM variants (MobileSAM, etc.)

๐Ÿ” Background

๐Ÿ“‘ Vision-Language Models (VLMs)

  • Beyond GPT-3, LLaMA โ†’ emergence of VLMs:
    • BLIP-2: Q-Former module / aligns image-text embeddings with multiple pretraining losses
    • LLaMA-Adapter / LLaVA / MiniGPT: adapters or projection layers / aligns vision features into LLM space / combines large LLM with vision
    • CogVLM: introduces Visual Expert Modules / converts image features at transformer-head level
    • SPHINX: supports multi-vision tasks with various mixing techniques
    • CogAgent / LLaVA-Phi: defines VLMs as agents / supports multi-step reasoning / interactive and tool-use tasks
    • GPT-4V: strong generalization, handles corner cases and complex real-world scenarios
  • BUT: Localization ability is still weaker than models like SAM.
  • โ†’ Hence, combining VLM and SAM in a training-free way to show segmentation capability!

  • Extra note (Preliminaries):
    • Segment Anything Model (SAM): bbox/point prompt-based segmentation model, composed of image encoder, prompt encoder, mask decoder.
    • AR-based VLMs: Auto-Regressive, i.e., Transformer decoders with next-token prediction (e.g., GPT-4V, Qwen2.5VL).

Object Detection Paradigms

Image

  • Open-Set: With CLIP, but still requires predefined categories (e.g., GLIP, GroundingDINO, SWORD, YOLO-World).
  • Open-Ended: Predict both object name + location with no predefined categories. GenerateU pioneered this, followed by DetCLIPv3. But requires large training data.
  • โ†’ VL-SAM: First training-free open-ended detection + segmentation.

๐Ÿงฑ VL-SAM Architecture

Image

  • Connects VLM (object recognition) with SAM (object localization).
  • [object recognition] Input image โ†’ VLM detects objects and produces attention map.
  • [object localization] Attention map โ†’ converted into point prompts for SAM โ†’ segmentation masks.

1) [object recognition] - Attention Map Generation (VLM)

  • Core idea: build SAM prompts directly from VLM attention!
    a. Ask VLM to list all objects (Tag2Text-like object list extraction)
    b. Save Q/K for all layers & heads during token generation
    c. Compute (Q \times K^T), apply causal mask, Softmax โ†’ similarity matrix S
    d. Compute head-layer weights (W = \text{Mean}(\max(S, \text{dim}=1), \text{dim}=0))
    e. Apply W to S, get corrected (Sโ€™)
    f. Aggregate across layers โ†’ final attention flow
    g. Regularize attention flow (to prevent collapse in AR-based VLMs)

Finally: attention map is ready!


2) [object localization] - SAM Prompt Generation

  • Extract positive/negative points from attention map
    • Positive: top values above threshold
    • Negative: low values outside positive regions
  • Run SAM once, then refine by re-sampling points from the first result
  • Aggregate masks with NMS

3) Ensembles for Accuracy

  • Sub-image tiling for small objects
  • Multiple prompts: ask VLM multiple times and ensemble results

๐Ÿ”ง Evaluation & Results

  • CogVLM-17B + SAM (ViT-H)
  • Zero-shot evaluation on LVIS, CODA
  • For open-ended eval: embed generated object names with CLIP and match with dataset labels

๐ŸŽฏ LVIS

Image

  • +3.4 APrare over GenerateU
  • Produces segmentation masks simultaneously, though not as strong as Mask R-CNN

๐ŸŽฏ CODA (corner-case detection)

Image

  • Achieves 40.1 mAR (vs 18.4 mAR for LLaVA-Grounding)
  • 74.1% of Oracle SAM upper bound (54.1 mAR)

๐ŸŽฏ Ablation Study

Image

  • Each component (attention generation, prompt sampling, iterative refinement, multi-scale/question ensemble) contributes
  • Without regularized attention flow โ†’ collapse
  • Prompt sampling strategy improves segmentation quality
  • Multi-scale + question ensemble maximizes corner-case detection

โœ… Conclusion

  • VL-SAM: first training-free open-ended detection + segmentation framework
  • Innovative design: connect VLM attention โ†’ SAM prompts
  • Enables label-free, training-free recognition, with potential applications in autonomous driving, robotics, safety-critical systems

(Appendix) Easy Example of Computing W โ€“ ๐Ÿงฎ N=3 (cat, dog, truck)

Equation (1):
( W = \text{Mean}(\max(S, \text{dim}=1), \text{dim}=0) )

  • ( S \in \mathbb{R}^{N \times N \times H \times L} )
  • Here, we show a single head h in a single layer l: ( S^{h,l} \in \mathbb{R}^{N \times N} )

1) Single Head h, Single Layer l Example

Tokens: cat(1), dog(2), truck(3) โ†’ N=3
Similarity matrix (S^{h,l}):

\[S^{h,l} = \begin{bmatrix} 0.70 & 0.20 & 0.10 & \quad \text{(Query=cat)} \\ 0.10 & 0.60 & 0.30 & \quad \text{(Query=dog)} \\ 0.15 & 0.25 & 0.60 & \quad \text{(Query=truck)} \end{bmatrix}\]

(a) Max(S, dim=1)

  • cat row: 0.70
  • dog row: 0.60
  • truck row: 0.60
\[\max(S^{h,l}, \text{dim}=1) = \begin{bmatrix} 0.70 \\ 0.60 \\ 0.60 \end{bmatrix}\]

(b) Mean of those values

\(W_{h,l} = \frac{0.70 + 0.60 + 0.60}{3} = \mathbf{0.6333}\)


2) Same Layer l with Head 2 (H=2)

Example ( S^{h2,l} ):

\[S^{h2,l} = \begin{bmatrix} 0.40 & 0.30 & 0.30 \\ 0.35 & 0.35 & 0.30 \\ 0.34 & 0.33 & 0.33 \end{bmatrix}\]
  • Max by row โ†’ [0.40, 0.35, 0.34]
  • Mean โ†’ ( W_{h2,l} = 0.3633 )

So within layer l:

  • Head 1: (W_{h1,l} = 0.6333)
  • Head 2: (W_{h2,l} = 0.3633)
    โ†’ Head 1 is more โ€œusefulโ€ and weighted higher.

3) Overall Shape & Broadcasting

  • Across all heads/layers:
    \(W \in \mathbb{R}^{1 \times 1 \times H \times L}\)

  • Next step (Equation 2):
    \(S' = \text{Mean}(S \odot W, \text{dim}=2)\)


4) Takeaway

  • Row-wise max (over Keys) = how strongly each Query focuses
  • Mean over Queries = headโ€™s overall importance (W_{h,l})
  • Apply W then average over heads = emphasizes good heads, yielding higher-quality (Sโ€™)

๐Ÿ”Ž (ํ•œ๊ตญ์–ด) VL-SAM: ํ•™์Šต ์—†์ด Open-Ended ๊ฐ์ฒด ํƒ์ง€ยท๋ถ„ํ• ๊นŒ์ง€!

Image

  • ์ œ๋ชฉ: Training-Free Open-Ended Object Detection and Segmentation via Attention as Prompts
  • ํ•™ํšŒ: NeurIPS 2024
  • ํ•ต์‹ฌ ํ‚ค์›Œ๋“œ: Open-Ended Detection, Segmentation, SAM, VLM, Attention Map, Training-Free
  • ์š”์•ฝ: Vision-Language Model(VLM)์˜ attention map์„ Segment Anything Model(SAM)์˜ prompt๋กœ ์—ฐ๊ฒฐํ•ด, ์ถ”๊ฐ€ ํ•™์Šต ์—†์ด unseen ๊ฐ์ฒด ํƒ์ง€์™€ ๋ถ„ํ• ์„ ๋™์‹œ์— ์ˆ˜ํ–‰!

๐Ÿš€ VL-SAM ํ•ต์‹ฌ ์š”์•ฝ

ํ•œ ์ค„ ์š”์•ฝ: โ€œ๋ผ๋ฒจ๋„, ํ•™์Šต๋„ ํ•„์š” ์—†์ด, VLM์˜ attention์œผ๋กœ SAM์„ ๋ถˆ๋Ÿฌ ๊ฐ์ฒด๋ฅผ ์ฐพ์•„๋‚ธ๋‹ค!โ€

Image

1) Training-Free Open-Ended Framework

  • ์ถ”๊ฐ€ ํ•™์Šต ์—†์ด VLM + SAM ์กฐํ•ฉ
  • attention map์„ prompt๋กœ ํ™œ์šฉํ•ด prompt ์ƒ์„ฑ!! (Open-set๋ณด๋‹ค ๋ฐœ์ „ - Openset์€ ๋‹จ์–ด๋ฅผ ์ œ๊ณตํ•ด์•ผํ•˜์ง€๋งŒ ์•ผ๋Š” ๊ทธ๊ฒƒ๋„ ํ•„์š”์—†๋‹ค!)

2) Attention Map Aggregation & Flow

  • VLM์˜ multi-head, multi-layer attention์„ ๋ชจ์•„ ๊ณ ํ’ˆ์งˆ ๋งต ์ƒ์„ฑ
  • causal mask๋กœ ์ธํ•œ collapse ๋ฌธ์ œ๋ฅผ regularized attention flow๋กœ ์™„ํ™”

3) Prompt Generation & Iterative Refinement

  • attention์—์„œ positive/negative point ์ƒ˜ํ”Œ๋ง
  • SAM์— ์ž…๋ ฅ ํ›„ ๊ฒฐ๊ณผ๋ฅผ ๋‹ค์‹œ feedbackํ•˜์—ฌ ์ ์ง„์  ์„ฑ๋Šฅ ํ–ฅ์ƒ

4) Generalization & Modularity

  • ๋‹ค์–‘ํ•œ VLM(MiniGPT-4, LLaVA ๋“ฑ) ๋ฐ SAM ๋ณ€ํ˜•(MobileSAM ๋“ฑ)์— ์ ์šฉ ๊ฐ€๋Šฅ

๐Ÿ” ๊ธฐ์กด ์—ฐ๊ตฌ์˜ ํ๋ฆ„

๐Ÿ“‘ VLM - Vision-Language Model

  • GPT-3, LLaMA ๋“ฑ์˜ LLM์„ ๋„˜์–ด Vision Language Model ๋“ค์ด ๋‚˜์˜ด!!
    • BLIP-2**: Q-Former ๋ชจ๋“ˆ ๋„์ž… / ์ด๋ฏธ์ง€ยทํ…์ŠคํŠธ ์ž„๋ฒ ๋”ฉ ์—ฐ๊ฒฐ ๋ฐ ์œตํ•ฉ / ์„ธ ๊ฐ€์ง€ alignment pretrain loss ์‚ฌ์šฉ
    • LLaMA-Adapter / LLaVA / MiniGPT**: ์–ด๋Œ‘ํ„ฐ(๋˜๋Š” projection layer) ํ™œ์šฉ / ์ด๋ฏธ์ง€ ํ”ผ์ฒ˜๋ฅผ ํ…์ŠคํŠธ ์ž„๋ฒ ๋”ฉ ๊ณต๊ฐ„์— ์ •๋ ฌ / ๋Œ€๊ทœ๋ชจ LLM๊ณผ ๋น„์ „ ๋ชจ๋‹ฌ ๊ฒฐํ•ฉ
    • CogVLM**: Visual Expert Modules ๋„์ž… / ์ด๋ฏธ์ง€ ํ”ผ์ฒ˜๋ฅผ transformer head ๋‹จ์œ„๋กœ ๋ณ€ํ™˜ยท์ •๋ ฌ / head๋ณ„ ์„ธ๋ถ„ํ™”๋œ ๋งคํ•‘ ์ œ๊ณต
    • SPHINX**: ๋‹ค์ค‘ ์‹œ๊ฐ ์ž‘์—… ์ง€์› / ์—ฌ๋Ÿฌ mixing ๊ธฐ๋ฒ• ํ™œ์šฉ / ๋‹ค์–‘ํ•œ ํƒœ์Šคํฌ์— ๋ฒ”์šฉ ์ ์šฉ ๊ฐ€๋Šฅ
    • CogAgent / LLaVA-Phi**: VLM์„ ์—์ด์ „ํŠธ(assistant)๋กœ ์ •์˜ / ๋ฉ€ํ‹ฐ์Šคํ… reasoning ์ˆ˜ํ–‰ / ๋Œ€ํ™”ํ˜•ยท๋„๊ตฌ ์‚ฌ์šฉ ๊ธฐ๋ฐ˜ ์ž‘์—… ์ฒ˜๋ฆฌ
    • GPT-4V**: ๊ฐ•๋ ฅํ•œ ๋ฒ”์šฉํ™” ๋Šฅ๋ ฅ / ์ƒˆ๋กœ์šดยทํฌ๊ท€ ์ƒํ™ฉ(corner case) ์ดํ•ด ๋ฐ ์ถ”๋ก  ๊ฐ€๋Šฅ / ์ž์œจ์ฃผํ–‰ ๋“ฑ ๋ณต์žกํ•œ ์‹ค์ œ ์‹œ๋‚˜๋ฆฌ์˜ค ์ฒ˜๋ฆฌ ๊ฐ€๋Šฅ
  • ํ•˜์ง€๋งŒ!! Localization ์—ญ๋Ÿ‰์€ SAM ๊ฐ™์€ ๋ชฉ์ ์— ๋งž๋Š” ๋ชจ๋ธ์— ๋น„ํ•ด ์„ฑ๋Šฅ์ด ๋–จ์–ด์ง!
  • ๊ทธ๋ž˜์„œ!! ๋‘ ๋ชจ๋ธ Training Free๋กœ ์—ฐ๊ฒฐํ•˜๋ฉฐ segmentation ๋Šฅ๋ ฅ์„ ๋ณด์—ฌ์ฃผ๊ณ ์žํ•จ!

    • ๋” ์•Œ์•„๋‘˜ ์‚ฌํ•ญ!!(Preliminary)
  • Segment Anything Model!!?
    • SAM์ด ๋Œ€ํ‘œ์ ์ด๋ฉฐ SAM์ด ์•„์ด๋””์–ด๋ฅผ ์–ป์€ MaskDINO๋„ ์žˆ์Œ! ใ…‡
    • bbox, point prompt ๊ธฐ๋ฐ˜์˜ Segmentation ๋ชจ๋ธ!!
    • 3๊ฐœ์˜ ์ฃผ์š” ์š”์†Œ๋กœ ๊ตฌ์„ฑ๋จ : ์ด๋ฏธ์ง€ ์ธ์ฝ”๋”, ํ”„๋กฌํฌํŠธ ์ธ์ฝ”๋”, ๋งˆ์Šคํฌ ๋””์ฝ”๋”
    • ์ด๋ฏธ์ง€ ์ธ์ฝ”๋”ฉ & ํ”„๋กฌํฌํŠธ ํ† ํฐ & ์ดˆ๊ธฐํ™” ๋งˆ์Šคํฌ ํ† ํฐ์„ ๋งˆ์Šคํฌ ๋””์ฝ”๋”์— ๋„ฃ์–ด์„œ!! ์ตœ์ ์˜ mask token์„ ๋งŒ๋“ค๊ณ , ์ด mask token์œผ๋กœ ์—ฌ๋Ÿฌ mask๋ฅผ ๋งŒ๋“ฌ
  • Auto-Regressive Based Vision-Language Model!!
    • VLM์€ AR-based์™€ non-AR๋กœ ๋‚˜๋‰จ!
    • AR-based (Auto-Regressive base)๋ž€ ๋ฌด์—‡์ธ๊ฐ€!? = Transformer decoder๊ฐ€ next-token prediction์œผ๋กœ ๋™์ž‘ํ•˜๋Š” ๊ตฌ์กฐ๋ฅผ ์˜๋ฏธ!
      • ๋‹ค์Œ token์„ predict ํ•˜๋Š”, decoding์— ์ค‘์ ์œผ๋กœ ์ž์—ฐ์–ด ํ…์ŠคํŠธ ์ƒ์„ฑ์„ ํ•จ! GPT-4V, Qwen2.5VL๋“ฑ ์šฐ๋ฆฌ๊ฐ€ ์•Œ๊ณ ์žˆ๋Š” VLM์ด ์—ฌ๊ธฐํ•ด๋‹น!
      • image encoder, text tokenizer, projection layers, language decoder๋กœ ๊ตฌ์„ฑ๋จ!
      • ์ด๋ฏธ์ง€์™€ ํ…์ŠคํŠธ๋ฅผ ๋ฐ›์œผ๋ฉด, ๊ฐ๊ฐ์˜ ์ธ์ฝ”๋”๋ฅผ ํ†ตํ•ด ํ† ํฐ์„ ๋ฐ›๊ณ , projection layers ๋กœ ์ด๋ฏธ์ง€์™€ ํ…์ŠคํŠธ๋ฅผ ์ •๋ ฌํ•˜๊ณ  ๋””์ฝ”๋”๋กœ ๊ฐ€์„œ output์„ ์ƒ์„ฑํ•œ๋‹ค!!
    • ๋ฐ˜๋ฉด non-AR์€ โ€œํ…์ŠคํŠธ ์ƒ์„ฑโ€๋ณด๋‹ค๋Š” ๋ถ„๋ฅ˜/์ •๋ ฌ/๋งค์นญ ์ค‘์‹ฌ, autoregressive LM ๋””์ฝ”๋”๊ฐ€ ํ•ต์‹ฌ์ด ์•„๋‹ˆ๋‹ค!
      • ํ…์ŠคํŠธ ์ƒ์„ฑ์™ธ์˜ ๋ถ„๋ฅ˜(CLIP), Mask ์ƒ์„ฑ(SAM) ๋“ฑ์— ๊ฐ•์ ์„ ๊ฐ€์ง!

๊ฐ์ฑ„ ํƒ์ง€ (Object Detection)

Image

  • Open-Set ๋ฐฉ๋ฒ•: CLIP ๋ชจ๋ธ ๋•๋ถ„์— Open-set์œผ๋กœ ๋ฐœ์ „ํ–ˆ์ง€๋งŒ, ์ด ์—ญ์‹œ๋„ ์‚ฌ์ „์— ํƒ์ง€ํ•  ๊ฐ์ฑ„์— ๋Œ€ํ•œ ์ •์˜ ํ•„์š” โ†’ ํ•œ๊ณ„
    • ์—ฌ๊ธฐ์„œ ์–ธ๊ธ‰ํ•œ Open-Set ๋ฐฉ๋ฒ•์€? GLIP, GroundingDINO, SWORD, YOLO-World
  • Open-Ended ๋ฐฉ๋ฒ•: ์‚ฌ์ „ ์นดํ…Œ๊ณ ๋ฆฌ ์—†์ด ๊ฐ์ฒด ์ด๋ฆ„+์œ„์น˜ ๋™์‹œ ์˜ˆ์ธก
    • GenerateU๊ฐ€ ์ฒซ Open-ended problem. ๋˜ํ•œ DetCLIPv3v ๋“ฑ!!
    • ํ•˜์ง€๋งŒ ๊ธฐ์กด open-ended ๋ชจ๋ธ์€ ๋Œ€๊ทœ๋ชจ ํ•™์Šต ๋ฐ์ดํ„ฐ/ํŒŒ์ธํŠœ๋‹ ํ•„์š”
  • โ†’ VL-SAM์€ ํ•™์Šต ์—†๋Š”(training-free) open-ended ํƒ์ง€+๋ถ„ํ•  ์ตœ์ดˆ ์ œ์•ˆ!

๐Ÿงฑ VL-SAM ๊ตฌ์กฐ (Architecture)

Image

  • VLM์™€ SAM์„ ๊ฐ๊ฐ object recognition, object localization ๋ชจ๋ธ๋กœ์„œ ์„œ๋กœ ์—ฐ๊ฒฐํ•œ๋‹ค!
  • [object recognition]์ด๋ฏธ์ง€ ์ธํ’‹์ด ๋“ค์–ด์˜ค๋ฉด, VLM์ด ์ด๋ฏธ์ง€ ๋‚ด์˜ ๊ฐ์ฑ„๋ฅผ ํƒ์ง€ํ•˜๊ณ !, attention generation module๋กœ attention map์„ ๋งŒ๋“ ๋‹ค.
  • [object localization] attention map์„ ๋ฐ”ํƒ•์œผ๋กœ point ptompt๋ฅผ ๋งŒ๋“ค๊ณ , SAM์œผ๋กœ ๋ณด๋‚ด์„œ segmentation ํ•œ๋‹ค!!

1) [object recognition] - Attention Map Generation (VLM)

  • ์—ฌ๊ธฐ๊ฐ€ VL-SAM์˜ ํ•ต์‹ฌ ์•„์ด๋””์–ด๋กœ!! SAM์— ๋„ฃ์„ Object prompt๋ฅผ ๋งŒ๋“œ๋Š” ๋ถ€๋ถ„!!
    a. ์šฐ์„  image๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ VLM์—๊ฒŒ ๋ชจ๋“  ๊ฐ์ฑ„๋ฅผ ๋‚˜์—ดํ•˜๋ผ๊ณ ํ•จ!, ๋‚˜์—ด๋œ๊ฒƒ์—์„œ Tag2Text ๊ธฐ๋ฒ•์œผ๋กœ object list ํ™•๋ณด
    b. ํ† ํฐ ์ƒ์„ฑ์‹œ์—, ๋ชจ๋“  ๋ ˆ์ด์–ด์˜, ๋ชจ๋“  head์˜ query(Q), key(K) ์— ๋Œ€ํ•˜์—ฌ ์ €์žฅํ•ด๋‘ . ์ถ”์ถœ๋œ ๊ฐ์ฑ„๋“ค์˜ Q, K๋ฅผ ๋ถˆ๋Ÿฌ์™€์„œ, c. ์ €์žฅ๋œ Q, K๋ฅผ(๋งŽ์Œ. Layer ์ˆ˜ X Head ์ˆ˜) Q ร— Kแต€ ์—ฐ์‚ฐ ํ•œ ๋‹ค์Œ causal mask๋ฅผ ์ ์šฉ, SoftMax ๋กœ ํ‘œ์ค€ํ™”ํ•ด์„œ similarity matrix(S) ๋ฅผ ๊ตฌํ•จ
    d. ๋ชจ๋“  ๋ ˆ์ด์–ด์˜, ๋ชจ๋“  head์˜ ๊ฐ€์ค‘์น˜๊ฐ€ ๋‹ค๋ฅด๊ธฐ์— ๊ฐ๊ฐ์˜ ๊ฐ€์ค‘์น˜ W๋ฅผ ๊ตฌํ•จ
    • ๊ตฌํ•˜๋Š” ๋ฐฉ๋ฒ•์€ W = Mean(Max(S, dim = 1), dim = 0).
    • ์‰ฌ์šด ์˜ˆ์‹œ! (๋งจ ๋’ค์—!)
      e. ๊ตฌํ•ด์ง„ ๊ฐ€์ค‘์น˜ W๋ž‘ S๋กœ ๋ณด์ •๋œ Sโ€ฒ ์‚ฐ์ถœ!
      f. Layer๋งˆ๋‹ค ์‚ฐ์ถœ๋œ Sโ€ฒ๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ์ตœ์ข… attention flow์‚ฐ์ถœ!
      g. ๊ทธ๋–„ Sโ€ฒ์„ ๊ทธ๋ƒฅ ๋‹ค ๊ณฑํ•˜๋ฉด AR-VLM์˜ ํŠน์„ฑ (Auto-Regressive ๋ชจ๋ธ์€ causal mask ์ ์šฉ๋จ) ์ƒ collapse(์™ผ์ชฝ ์ƒ๋‹จ์— ๋ชจ์ด๋Š” ๋ฌธ์ œ)๊ฐ€ ๋ฐœ์ƒํ•˜๊ฒŒ๋˜๊ณ ! ๊ทธ๋ž˜์„œ ์ด๋ฅผ ๋ง‰๊ณ ์ž Regularized attention flow column๋ฅผ ์‚ฌ์šฉ!!(์™ผ์ชฝ ์œ„์— ์žˆ๋Š” attention ๊ฐ’๋“ค์„ ์ธ์œ„์ ์œผ๋กœ ์ค„์—ฌ์คŒ์คŒ)
      Finally. ์ตœ์ข…์ ์œผ๋กœ attention map์ด ๋งŒ๋“ค์–ด์ง!!!

d, e๋‹จ๊ณ„์˜ ์ˆ˜์‹!! Image

e๋‹จ๊ณ„๋ฅผ ์‹œ๊ฐํ™”!! Image

f๋‹จ๊ณ„๋ฅผ ์‹œ๊ฐํ™”!! Image

g๋‹จ๊ณ„์˜ ์—ญํ• ์„ ์ฆ๋ช…!! Image

2) [object localization] - SAM Prompt Generation

  • ์•ž ๋‹จ๊ณ„์—์„œ ๋งŒ๋“ค์–ด์ง„ attention map์—์„œ positive&negative point๋ฅผ ์ถ”์ถœํ•ด์„œ SAM ํ• ๊ฒƒ์ž„
  • ๊ทธ๋Ÿฐ๋ฐ, attention map์ด ์™„๋ฒฝํ•˜์ง€๋งŽ์€ ์•Š๊ธฐ์—, ํ•„ํ„ฐ๋ง์ด ํ•„์š”!
    • positive area : ์ž„๊ณ„๊ฐ’์„ ๋‘๊ณ  ๊ทธ๊ฒƒ๋ณด๋‹ค ํฐ ์˜์—ญ, ๊ทธ์ค‘์—์„œ๋„ ์ œ์ผ ํฐ ๊ฐ’์„ positive point
    • negative area : positive area ์™ธ์˜ ์˜์–ต, ๊ทธ์ค‘ ์ œ์ผ ์ž‘์€ ๊ฐ’์„ negative point
  • ๋˜ํ•œ SAM ๊ฒฐ๊ณผ๊ฐ€ ์žก์Œ์ด ๋งŽ์„ ์ˆ˜ ์žˆ์–ด 2๋ฒˆ ๋ฐ˜๋ณตํ•ด์„œ ์ง„ํ–‰ํ•จ!
    • ์ฒซ๋ฒˆ์งธ๋กœ๋Š” PerSAM ์ฒ˜๋Ÿผ, point ์Œ์œผ๋กœ segmentation mask ์—ฌ๋Ÿฌ๋ฒˆ ์ƒ์„ฑ
    • ๋‘๋ฒˆ์จฐ๋กœ๋Š” ์ฒซ๋ฒˆ์จฐ์˜ ๊ฒฐ๊ณผ๋ฅผ attention map์œผ๋กœ ๋‹ค์‹œํ•œ๋ฒˆ ๋งˆ์Šคํฌํ•˜๊ณ  positive/negative point ์Œ์„ ์ƒˆ๋กœ ์ถ”์ถœ, ์—ฌ๋Ÿฌ๋ฒˆ SAM์— ํ•ด์„œ ๊ฐœ์„ ํ•˜๊ณ 
    • ๋งˆ์ง€๋ง‰์œผ๋กœ ์ด ๊ฒฐ๊ณผ๋“ค์„ NMS ๋ฐฉ๋ฒ•์œผ๋กœ ์ง‘๊ณ„

3) Ensembles - ์ •ํ™•ํ•œ ๋‹ต์„ ์œ„ํ•ด!!

  • ํ•ด์ƒ๋„ ๋‚ฎ์€ ์ด๋ฏธ์ง€์˜ ์ž‘์€ ๋ถ€๋ถ„์€ ๋ชป ์žก์„์ˆ˜ ์žˆ์–ด!
    • ๊ทธ๋ž˜์„œ sub image๋กœ ์ชผ๊ฐœ์„œ ๊ฐ๊ฐ VL-SAM ์ž‘์—…ํ•ด!!
  • ํ”„๋กฌํฌํŠธ์— ์˜ˆ๋ฏผํ•œ๋ฐ?
    • ๊ทธ๋ž˜์„œ VLM์— ๋ฌผ์–ด๋ด์„œ 10๊ฐœ ํ”„๋กฌํฌํŠธ ๋ฐ›๊ณ , 10๊ฐœ ํ”„๋กฌํฌํŠธ ๊ฒฐ๊ณผ๋ฅผ ๋ชจ๋‘ ํ•ฉ์ณ~

๐Ÿ”ง ์‹ค์ œ ์‹œํ—˜๋ฐฉ๋ฒ•(Training Recipe) ๋ฐ ๊ฒฐ๊ณผ!

  • CogVLM-17B + SAM(ViT-Huge) ์กฐํ•ฉ์˜ ๋ชจ๋ธ์„ ํ™œ์šฉ!
    • CogVLM-17B ์€ EVA2-CLIP-E (๋น„์ „) + Vicuna (์–ธ์–ด)์˜ ๊ตฌ์กฐ
  • Zero-shot ๋ฐฉ์‹์œผ๋กœ LVIS, CODA ๋“ฑ ๋ฐ์ดํ„ฐ์…‹ ํ‰๊ฐ€
  • open-ended์ด๊ธฐ์— ๋ชจ๋ธ์ด ์ƒ์„ฑํ•œ ์ž์œ ๋กœ์šด ๊ฐ์ฒด๋ช…(open-vocab)์„ CLIP ํ…์ŠคํŠธ ์ธ์ฝ”๋”๋กœ ์ž„๋ฒ ๋”ฉํ•œ ๋’ค, ๋ฐ์ดํ„ฐ์…‹์˜ ์ •ํ•ด์ง„ ๋ผ๋ฒจ๋“ค๊ณผ ๋งค์นญํ•ด์„œ ํ‰๊ฐ€

๐ŸŽฏ LVIS (long-tail segmentation)

Image

  • ๊ธฐ์กด open-ended ๋ฐฉ๋ฒ• GenerateU ๋Œ€๋น„ +3.4 APrare ํ–ฅ์ƒ
  • segmentation mask๊นŒ์ง€ ๋™์‹œ์— ์ œ๊ณต, ๋‹ค๋งŒ Mask R-CNN๋งŒํผ์˜ ์„ฑ๋Šฅ์€ ์•ˆ๋‚˜์˜ด์˜ด

๐ŸŽฏ CODA (corner-case detection, ์ž์œจ์ฃผํ–‰)

Image

  • 40.1 mAR ๋‹ฌ์„ฑ (๊ธฐ์กด LLaVA-Grounding: 18.4 mAR)
  • Oracle SAM upper bound(54.1 mAR)์˜ 74.1% ์ˆ˜์ค€ ์„ฑ๋Šฅ

๐ŸŽฏ Ablation Study

Image

  • Attention generation, prompt sampling, iterative refinement, multi-scale & question ensemble ๊ฐ๊ฐ ์„ฑ๋Šฅ ๊ฐœ์„  ๊ธฐ์—ฌ
    • Regularized attention flow ์—†์„ ์‹œ attention collapse ๋ฐœ์ƒ
    • Prompt sampling ์ „๋žต์ด segmentation ํ’ˆ์งˆ ๊ฐœ์„ 
    • Multi-scale + Question ensemble ์กฐํ•ฉ ์‹œ corner case ํƒ์ง€ ์„ฑ๋Šฅ ๊ทน๋Œ€ํ™”

โœ… ๊ฒฐ๋ก 

  • VL-SAM์€ open-ended ๊ฐ์ฒด ํƒ์ง€ยท๋ถ„ํ• ์„ training-free๋กœ ์ตœ์ดˆ ๋‹ฌ์„ฑ
  • VLM์˜ attention์„ SAM prompt๋กœ ์—ฐ๊ฒฐํ•˜๋Š” ํ˜์‹ ์  ๊ตฌ์กฐ
  • ๋ผ๋ฒจยทํ•™์Šต ์—†๋Š” ๋ฒ”์šฉ ์ธ์‹์ด ๊ฐ€๋Šฅํ•ด, ์ž์œจ์ฃผํ–‰/๋กœ๋ณดํ‹ฑ์Šค/์•ˆ์ „-critical ์‹œ์Šคํ…œ ๋“ฑ ์‘์šฉ ๊ฐ€๋Šฅ์„ฑ ํผ

(์ถ”๊ฐ€) W ๊ตฌํ•˜๋Š” ์‰ฌ์šด ์˜ˆ์‹œ - ๐Ÿงฎ N=3 (cat, dog, truck) ์˜ˆ์‹œ๋กœ ๋ณด๋Š” Head๋ณ„ ์ค‘์š”๋„ Weight W ๊ณ„์‚ฐ

๋ฌธํ—Œ ์‹ (1):
W = Mean(Max(S, dim = 1), dim = 0)

  • S โˆˆ โ„^{N ร— N ร— H ร— L} (N=ํ† ํฐ ์ˆ˜, H=ํ—ค๋“œ ์ˆ˜, L=๋ ˆ์ด์–ด ์ˆ˜)
  • ์—ฌ๊ธฐ์„œ๋Š” ํ•œ ๋ ˆ์ด์–ด(l ๊ณ ์ •)์—์„œ ํ•œ ํ—ค๋“œ(h)๋ฅผ ๋จผ์ € ์˜ˆ๋กœ ๋“ญ๋‹ˆ๋‹ค: S^{h,l} โˆˆ โ„^{Nร—N}

1) ํ•œ ๊ฐœ Head(h), ํ•œ ๊ฐœ Layer(l)์—์„œ์˜ ์˜ˆ์‹œ

ํ† ํฐ: cat(1), dog(2), truck(3) โ†’ N=3
ํ•ด๋‹น head์˜ similarity ํ–‰๋ ฌ(= Qร—Kแต€ ํ›„ softmax ๊ฒฐ๊ณผ๋ผ๊ณ  ์ƒ๊ฐ):

\[S^{h,l} = \begin{bmatrix} 0.70 & 0.20 & 0.10 & \quad \text{(Query=cat)} \\ 0.10 & 0.60 & 0.30 & \quad \text{(Query=dog)} \\ 0.15 & 0.25 & 0.60 & \quad \text{(Query=truck)} \end{bmatrix}\]

(a) Max(S, dim = 1) โ† ์—ด ๋ฐฉํ–ฅ(Key j) ์ตœ๋Œ“๊ฐ’

  • Query(cat) ํ–‰์˜ ์ตœ๋Œ€๊ฐ’: max(0.70, 0.20, 0.10) = 0.70
  • Query(dog) ํ–‰์˜ ์ตœ๋Œ€๊ฐ’: max(0.10, 0.60, 0.30) = 0.60
  • Query(truck) ํ–‰์˜ ์ตœ๋Œ€๊ฐ’: max(0.15, 0.25, 0.60) = 0.60
\[\text{Max}(S^{h,l}, \text{dim}=1) = \begin{bmatrix} 0.70 \\ 0.60 \\ 0.60 \end{bmatrix}\]

(b) Mean(โ€ฆ, dim = 0) โ† ํ–‰ ๋ฐฉํ–ฅ(Query i) ํ‰๊ท 

\(W_{h,l} = \frac{0.70 + 0.60 + 0.60}{3} = \mathbf{0.6333}\)

  • ์ด ๊ฐ’ W_{h,l} ๊ฐ€ โ€œLayer l ์˜ Head h ์ค‘์š”๋„(์ง‘์ค‘๋„)โ€์ž…๋‹ˆ๋‹ค.
  • ์ง๊ด€: ๊ฐ Query๊ฐ€ ํ•œ Key์— ์–ผ๋งˆ๋‚˜ ๊ฐ•ํ•˜๊ฒŒ ์ง‘์ค‘ํ–ˆ๋Š”์ง€๋ฅผ ํ–‰๋ณ„ ์ตœ๋Œ€๊ฐ’์œผ๋กœ ๋ณด๊ณ , ๊ทธ๊ฑธ ๋ชจ๋“  Query์— ๋Œ€ํ•ด ํ‰๊ท ๋‚ด ํ•ด๋‹น Head์˜ ๋Œ€ํ‘œ ์ง‘์ค‘๋„๋กœ ์‚ผ์Œ.

2) ๊ฐ™์€ Layer(l)์— Head๊ฐ€ 2๊ฐœ(H=2) ์žˆ๋‹ค๊ณ  ๊ฐ€์ •

๋‘ ๋ฒˆ์งธ head์˜ S^{h2,l} ์˜ˆ:

\[S^{h2,l} = \begin{bmatrix} 0.40 & 0.30 & 0.30 \\ 0.35 & 0.35 & 0.30 \\ 0.34 & 0.33 & 0.33 \end{bmatrix}\]
  • Max by row โ†’ [0.40, 0.35, 0.34]
  • Mean of those โ†’ ( W_{h2,l} = (0.40 + 0.35 + 0.34)/3 = \mathbf{0.3633} )

๊ฒฐ๊ณผ์ ์œผ๋กœ, ๊ฐ™์€ ๋ ˆ์ด์–ด l ์•ˆ์—์„œ

  • Head 1 ์ค‘์š”๋„ (W_{h1,l} = 0.6333)
  • Head 2 ์ค‘์š”๋„ (W_{h2,l} = 0.3633)
    โ†’ Head 1์ด ๋” โ€œ์œ ์šฉํ•œโ€ head๋กœ ํŒ๋‹จ๋˜์–ด ๊ฐ€์ค‘์น˜๊ฐ€ ๋” ํผ.

3) ์ „์ฒด ๋ชจ์–‘๊ณผ ๋ธŒ๋กœ๋“œ์บ์ŠคํŒ…

  • ๋ชจ๋“  head/๋ ˆ์ด์–ด์— ๋Œ€ํ•ด ์œ„ ๊ณ„์‚ฐ์„ ํ•˜๋ฉด
    \(W \in \mathbb{R}^{1 \times 1 \times H \times L}\)

  • ๋‹ค์Œ ๋‹จ๊ณ„ ์‹ (2): \(S' = \text{Mean}(S \odot W, \text{dim}=2)\)

    • (S \odot W): head ์ฐจ์›(H)์— ๋Œ€ํ•ด ํฌ์ธํŠธ์™€์ด์ฆˆ ๊ณฑ (๋ธŒ๋กœ๋“œ์บ์ŠคํŒ…)
    • ๊ทธ ํ›„ head ์ถ• ํ‰๊ท (dim=2) โ†’ head-๊ฐ€์ค‘ ํ†ตํ•ฉ๋œ (Sโ€™ \in \mathbb{R}^{N \times N \times L})

4) ํ•œ ์ค„ ์š”์•ฝ

  • ํ–‰๋ณ„ ์ตœ๋Œ€๊ฐ’(Max over Keys) = ๊ฐ Query๊ฐ€ ์–ผ๋งˆ๋‚˜ ๊ฐ•ํ•˜๊ฒŒ ์ง‘์ค‘ํ–ˆ๋‚˜
  • ๊ทธ ์ตœ๋Œ€๊ฐ’๋“ค์˜ ํ‰๊ท (Mean over Queries) = ํ•ด๋‹น Head์˜ ๋Œ€ํ‘œ ์ง‘์ค‘๋„(W_{h,l})
  • W๋กœ ๊ฐ€์ค‘ ํ›„ Head ํ‰๊ท  โ†’ ํ’ˆ์งˆ ์ข‹์€ Head์˜ ์ •๋ณด๋ฅผ ๋” ์‚ด๋ฆฐ Sโ€™ ์™„์„ฑ
This post is licensed under CC BY 4.0 by the author.