Post

๐Ÿ”Ž Open-Vocabulary SAM: Segment and Recognize Twenty-thousand Classes Interactively

๐Ÿ”Ž Open-Vocabulary SAM: Segment and Recognize Twenty-thousand Classes Interactively

๐Ÿ”Ž Open-Vocabulary SAM: Expanding to Recognize & Segment 20,000 Classes!

Image


๐Ÿš€ Key Highlights of Open-Vocabulary SAM

One-liner: โ€œSAM doesnโ€™t just cut objects anymore, it also names them!โ€

1) Basic Structure!! : CLIP2SAM & SAM2CLIP

  • (Segmentation) Encode images with CLIP โ†’ align to SAM via CLIP2SAM โ†’ segmentation results from SAM Decoder
  • (Recognition) Segmentation results โ†’ SAM2CLIP โ†’ object names retrieved

2) Open-Vocabulary ๐ŸŽฏ

  • Recognizes 20,000+ classes without pre-defined labels.
  • With a text prompt (e.g., โ€œcatโ€, โ€œchairโ€), the segmented mask is matched to the correct label.

3) Interactive Extensibility ๐Ÿ› ๏ธ

  • Retains SAMโ€™s point/box/everything prompts.
  • Users can provide prompts (words/sentences) for real-time recognition + segmentation.

4) General Vision Pipeline โšก

  • Evolving from a pure segmentation tool to an open segmentation system with recognition.
  • In research and industry, obtain both โ€œwhat + whereโ€ instantly with a single click.

  • VLMs have advanced! Especially CLIP with contrastive vision-language pre-training โ†’ strong zero-shot recognition.
  • Many studies in open-vocabulary detection/segmentation using CLIP.
  • Prompting has evolved: from NLP into vision with point/bbox prompts.
  • SAM: a large-scale model for segmentation, widely applied to tracking, generation, etc.
  • This work uniquely fuses CLIP and SAM!

๐Ÿงฑ Open-Vocabulary SAM Architecture


Baseline (a): Image Cropping Baseline

Image

  • Cut the image according to the SAM mask โ†’ input to CLIP โ†’ retrieve label

Baseline (b): Feature Cropping Baseline

Image

  • From CLIP embeddings, crop the region corresponding to the SAM mask โ†’ retrieve label

  • Problems of these baselines:
    1) Using two separate backbones โ†’ high computational cost
    2) Different training paradigms (SAM: supervised, CLIP: contrastive) โ†’ unstable knowledge transfer
    3) Even with adapters, small object recognition remains weak
    4) No prior exploration of how to fuse SAMโ€™s dense visual features with CLIPโ€™s semantic features for open-vocabulary segmentation

OVSAM: Unified Architecture

1) Two backbones = costly

  • Solution: Use CLIP encoder only; keep SAMโ€™s prompt encoder + decoder
    2) Different training (SAM vs CLIP) = unstable knowledge transfer
  • Solution: Introduce SAM2CLIP to bridge feature spaces
    3) Small object issue
  • Solution: Enhance CLIP2SAM with FPN + R-CNN-like MLP
    4) No SAMโ€“CLIP integration for open-vocab
  • This work proposes a unified solution with open-vocabulary capabilities!

Image

  • Image Encoder (CLIP + CLIP2SAM)
    • Use CLIPโ€™s visual encoder, then project features through CLIP2SAM for alignment with SAM Decoder
  • Prompt Encoder (SAM)
    • Same as original SAM: handles point/box/mask prompts
  • Mask Decoder (SAM)
    • Combines CLIP2SAM features + prompts โ†’ outputs segmentation masks
  • Recognition Head (SAM2CLIP)
    • Project masks into CLIP embedding space
    • Match with text embeddings via cosine similarity
    • โ†’ Final output = segmentation + labeling
  • CLIP2SAM = โ€œRecognition โ†’ Segmentationโ€ bridge
  • SAM2CLIP = โ€œSegmentation โ†’ Recognitionโ€ bridge

๐Ÿ”ง Training Recipe

  • Step 1: SAM2CLIP training with SA-1B (1%) (distillation loss)
    • Extract feature (F_{sam}) with SAM encoder
    • Extract feature (E_I) with CLIP visual encoder
    • Adapter (Transformer layers) aligns CLIP features to SAM features (distillation)
\[L_{distill} = \mathrm{MSE}\!\left(F_{sam}, A_{sam2clip}\!\left(\mathrm{Fusion}\!\left(E_I^i\right)\right)\right)\]
  • Step 2: Joint training of CLIP2SAM + Mask Decoder with COCO/LVIS
    • CLIP2SAM: transforms CLIP semantic features into SAM-compatible region features
    • Pipeline:
      1. Image โ†’ CLIP encoder (frozen) โ†’ multi-scale features
      2. Prompt (point/box) โ†’ Prompt Encoder (SAM)
      3. CLIP2SAM(+FPN): multi-scale CLIP features โ†’ SAM-compatible region features
      4. Mask Decoder (SAM): predicts mask/IoU
      5. Recognition Head: Q_label vs CLIP text embedding โ†’ label score
  • Additional: Joint training with ImageNet โ†’ expansion to 22K classes

๐Ÿงช Experimental Results

๐ŸŽฏ Open-Vocabulary Segmentation

Image

  • COCO (IoU_b=81.5 / IoU_n=84.0), LVIS (IoU_b=83.1 / IoU_n=83.6)
    • Balanced performance across base/novel classes, outperforming baselines
  • FLOPs (1,180G) and parameters (304M) significantly reduced โ†’ efficiency + accuracy
  • With * (mask center point prompt), baselines collapse in performance
    • Image-Crop baseline*: COCO IoU_n=26.4, LVIS IoU_n=2.3
    • OVSAM*: IoU_b=63.6, IoU_n=67.9 โ†’ robust even with weak prompts

๐ŸŽฏ Segmentation

Image

  • Mask quality nearly matches SAM-Huge while using ~half the parameters!

Image

  • With bbox prompts from an OV-detector, OVSAM achieves strong labeling performance compared to other segmentation models.

๐Ÿ‘€ Qualitative Comparisons

Image

  • Works well with both box and point prompts
  • Everything mode: auto-labels dozens of masks (e.g., โ€œcatโ€, โ€œdogโ€, โ€œsofaโ€)
  • Useful for interactive tools, robotics/AR, accessibility technologies

๐Ÿงช Ablation Studies

  • Recognition Headโ€™s text embedding precision is crucial โ†’ CLIP-based learning yields stability
  • Combining IoU + Text Similarity Joint Loss improves maskโ€“text alignment
  • Scaling from 1K โ†’ 20K classes leads to linear runtime increase โ†’ real-time inference feasible

โœ… Conclusion

  • Open-Vocabulary SAM = SAMโ€™s โ€œSegment Anythingโ€ + CLIPโ€™s โ€œRecognize Anythingโ€
  • Enables 20,000+ class zero-shot recognition with full prompt compatibility
  • Ready for practical deployment: instantly outputs what + where
  • OVSAM evolves SAM into not just a mask generator, but a naming vision system โ€” the new standard for segmentation + recognition!

๐Ÿ”Ž (ํ•œ๊ตญ์–ด) Open-Vocabulary SAM: 20,000๊ฐœ ํด๋ž˜์Šค๊นŒ์ง€ ์ธ์‹ยท๋ถ„ํ•  ํ™•์žฅ!

Image

  • ์ œ๋ชฉ: Open-Vocabulary SAM: Segment and Recognize Twenty-thousand Classes Interactively
  • ํ•™ํšŒ: ECCV 2024
  • ์ฝ”๋“œ/์ฒดํฌํฌ์ธํŠธ: GitHub โ€“ OVSAM
  • ํ•ต์‹ฌ ํ‚ค์›Œ๋“œ: Segment Anything, Open-Vocabulary, CLIP, Recognition, Promptable Segmentation, CLIP2SAM, SAM2CLIP
  • ์š”์•ฝ: SAM์˜ ๋ถ„ํ•  ๋Šฅ๋ ฅ์— ๊ฐœ๋ฐฉํ˜• ์–ดํœ˜ ์ธ์‹์„ ์ ‘๋ชฉ โ†’ 20,000๊ฐœ ํด๋ž˜์Šค ์ˆ˜์ค€์˜ ํ™•์žฅ์„ฑ ํ™•๋ณด!

๐Ÿš€ Open-Vocabulary SAM ํ•ต์‹ฌ ์š”์•ฝ

ํ•œ ์ค„ ์š”์•ฝ: โ€œSAM์œผ๋กœ ๊ฐ์ฒด๋ฅผ ์ž๋ฅด๋Š” ๊ฒƒ์—์„œ ๋ฉˆ์ถ”์ง€ ์•Š๊ณ , ์ด๋ฆ„๊นŒ์ง€ ๋ถ™์—ฌ์ค€๋‹ค!โ€

1) ๊ฐ„๋‹จํžˆ ๋ณธ ๊ตฌ์กฐ!! : CLIP2SAM & SAM2CLIP

  • (segmentation) ์ด๋ฏธ์ง€๋ฅผ CLIP์œผ๋กœ ์ธ์ฝ”๋”ฉ โ†’ CLIP2SAM์œผ๋กœ SAM์— ์–ผ๋ผ์ธ โ†’ SAM Decoder์—์„œ segmentation ๊ฒฐ๊ณผ์ถ”์ถœ
  • (Recognizing) ๊ทธ segmentation ๊ฒฐ๊ณผ โ†’ SAM2CLIP โ†’ ๊ฐ์ฑ„ ์ด๋ฆ„ ์ถ”์ถœ

2) ๊ฐœ๋ฐฉํ˜• ์–ดํœ˜(Open-Vocabulary) ๐ŸŽฏ

  • 20,000๊ฐœ ์ด์ƒ์˜ ํด๋ž˜์Šค์— ๋Œ€ํ•ด ์‚ฌ์ „ ์ •์˜๋œ ๋ผ๋ฒจ ์—†์ด๋„ ์ธ์‹ ๊ฐ€๋Šฅ.
  • ํ…์ŠคํŠธ ํ”„๋กฌํ”„ํŠธ(์˜ˆ: โ€œ๊ณ ์–‘์ดโ€, โ€œ์˜์žโ€)๋ฅผ ์ž…๋ ฅํ•˜๋ฉด, ๋ถ„ํ• ๋œ ๋งˆ์Šคํฌ๋ฅผ ๋Œ€์‘์‹œ์ผœ ๊ฐ์ฒด๋ฅผ โ€œ์•Œ์•„๋ด„โ€.

3) ์ƒํ˜ธ์ž‘์šฉ ํ™•์žฅ์„ฑ ๐Ÿ› ๏ธ

  • ๊ธฐ์กด SAM์˜ ํฌ์ธํŠธ/๋ฐ•์Šค/Everything ํ”„๋กฌํ”„ํŠธ๋ฅผ ์œ ์ง€.
  • ์‚ฌ์šฉ์ž๊ฐ€ ์ง€์ •ํ•œ ํ”„๋กฌํ”„ํŠธ(๋‹จ์–ดยท๋ฌธ์žฅ)๋กœ ์‹ค์‹œ๊ฐ„ ์ธ์‹+๋ถ„ํ•  ์ˆ˜ํ–‰ ๊ฐ€๋Šฅ.

4) ๋ฒ”์šฉ ๋น„์ „ ํŒŒ์ดํ”„๋ผ์ธ โšก

  • ๋‹จ์ˆœ ๋ถ„ํ•  ํˆด์—์„œ ์ธ์‹ ๊ฐ€๋Šฅํ•œ ์˜คํ”ˆ ์„ธ๊ทธ๋ฉ˜ํ…Œ์ด์…˜ ์‹œ์Šคํ…œ์œผ๋กœ ์ง„ํ™”.
  • ์—ฐ๊ตฌยท์‚ฐ์—… ํ˜„์žฅ์—์„œ ํด๋ฆญ ํ•œ ๋ฒˆ์œผ๋กœ โ€œ๋ฌด์—‡์ธ์ง€+์–ด๋””์ธ์ง€โ€๋ฅผ ๋™์‹œ์— ์–ป์„ ์ˆ˜ ์žˆ์Œ.

๐Ÿ” ๊ธฐ์กด ์—ฐ๊ตฌ์˜ ํ๋ฆ„

  • VLM๋“ค์ด ๋ฐœ์ „ํ•ด์˜ด! ํŠนํžˆ contrastive vision-language pre-training์˜ CLIP์œผ๋กœ zero-shot์ž˜ํ•จ!!
  • Open Vocabulary ์˜ ์—ฐ๊ตฌ๊ฐ€ ๋งŽ์Œ! CLIP์„ ๊ธฐ๋ฐ˜์œผ๋กœ object detection, segmentation ์˜์—ญ ๋ชจ๋‘ OV๋กœ ์—ฐ๊ตฌ๋“ค์ด ์ง„ํ–‰๋จ
  • Prompting์˜ ๋ฐœ์ „. NLP์—์„œ ์‹œ์ž‘๋œ ํ”„๋กฌํฌํŠธ, Vision์—๋„ ์ ์šฉ๋˜๋ฉฐ point, bbox ํ”„๋กฌํฌํŠธ๊ฐ€ ๋‚˜์˜ด
  • Segmentation! SAM! ์ดˆ๋Œ€๊ทœ๋ชจ ๋ฐ์ดํ„ฐ์™€ ๋ชจ๋ธ๋กœ ๋“ฑ์žฅํ•œ segmentation ๋ชจ๋ธ. ์—ญ๋Ÿ‰์ด ์ข‹์•„ tracking, ์ด๋ฏธ์ง€ ์ƒ์„ฑ ๋“ฑ ๋‹ค์–‘ํ•œ ๋ถ„์•ผ์— ํ™œ์šฉ
  • ์ด๋ฒˆ ์—ฐ๊ตฌ๋Š” ์ด VLM(CLIP)๊ณผ SAM์„ ์œตํ•ฉํ•œ ์—ฐ๊ตฌ๋‹ค!

๐Ÿงฑ Open-Vocabulary SAM ๊ตฌ์กฐ (Architecture)


baseline(a): Image Cropping Baseline

Image

  • SAM์œผ๋กœ ์ž๋ฅธ ๋งˆ์Šคํฌ๋Œ€๋กœ ์ด๋ฏธ์ง€๋ฅผ ์ž˜๋ผ์„œ CLIP์— ๋„ฃ์–ด์„œ label ์ฐพ์Œ

baseline (b): Feature Cropping Baseline

Image

  • CLIP์ž„๋ฒ ๋”ฉ ๊ฒฐ๊ณผ์—์„œ SAM mask ๋ถ€๋ถ„๋งŒ ์ž˜๋ผ์„œ label ์ฐพ์Œ

  • ์œ„์˜ Baseline ๋“ค์€ ๋ช‡๊ฐ€์ง€ ๋ฌธ์ œ๊ฐ€ ์žˆ์Œ!
    1) 2๊ฐœ์˜ ๋ณ„๋„ Backbone์„ ์‚ฌ์šฉํ•˜๊ธฐ์— computational costs ๊ฐ€ ํผ!
    2) SAM and CLIP ์˜ ํ•™์Šต๋ฒ•์ด ๋‹ค๋ฆ„(SAM : Supervised, CLIP : contrastive)์— ๋”ฐ๋ผ ์ง€์‹์ „์ด๊ฐ€ ๋ถˆ์•ˆ์ •ํ•จ
    3) ์–ด๋Œ‘ํ„ฐ๋กœ ํ•ฉ์น˜๋”๋ผ๊ณ  ์กฐ๊ทธ๋งŒ ๊ฐ์ฒด ์ธ์‹์—์„œ ์ฐจ์ด๊ฐ€ ํผ
    4) โ€˜SAM์˜ dense visual feature์™€ CLIP์˜ semantic feature๋ฅผ ์–ด๋–ป๊ฒŒ ํ•ฉ์น ์ง€์— ๋Œ€ํ•œ ์—ฐ๊ตฌโ€™ ๋“ฑ SAM๊ณผ CLIP์„ open-vocabulary capability๋กœ ํ†ตํ•ฉํ•˜๋Š” ์‹œ๋„๊ฐ€ ์—†์—ˆ๋‹ค!

OVSAM : Unified Architecture

1) 2๊ฐœ์˜ ๋ณ„๋„ Backbone์„ ์‚ฌ์šฉํ•˜๊ธฐ์— computational costs ๊ฐ€ ํผ!

  • ํ•ด๊ฒฐ์ฑ… : CLIP encoder๋ฅผ ์‚ฌ์šฉํ•˜์ž!! prompt encoder์™€ Decoder๋Š” SAM ์œผ๋กœ ์“ฐ์ž!! 2) SAM and CLIP ์˜ ํ•™์Šต๋ฒ•์ด ๋‹ค๋ฆ„(SAM : Supervised, CLIP : contrastive)์— ๋”ฐ๋ผ ์ง€์‹์ „์ด๊ฐ€ ๋ถˆ์•ˆ์ •ํ•จ
  • ํ•ด๊ฒฐ์ฑ… : SAM2CLIP ์œผ๋กœ SAM๊ณผ CLIP์˜ ํŠน์ง•์„ ์—ฐ๊ฒฐ 3) ์–ด๋Œ‘ํ„ฐ๋กœ ํ•ฉ์น˜๋”๋ผ๊ณ  ์กฐ๊ทธ๋งŒ ๊ฐ์ฒด ์ธ์‹์—์„œ ์ฐจ์ด๊ฐ€ ํผ
  • ํ•ด๊ฒฐ์ฑ… : CLIP2SAM์— FPN์„ ๋„ฃ๊ณ  R-CNN ๊ฐ™์€ MLP๋ฅผ ๋„ฃ์–ด์„œ ํ•ด๊ฒฐ@@ 4) โ€˜SAM์˜ dense visual feature์™€ CLIP์˜ semantic feature๋ฅผ ์–ด๋–ป๊ฒŒ ํ•ฉ์น ์ง€์— ๋Œ€ํ•œ ์—ฐ๊ตฌโ€™ ๋“ฑ SAM๊ณผ CLIP์„ open-vocabulary capability๋กœ ํ†ตํ•ฉํ•˜๋Š” ์‹œ๋„๊ฐ€ ์—†์—ˆ๋‹ค!
  • ์ด๋ฒˆ ์—ฐ๊ตฌ์—์„œ ํ†ตํ•ฉํ•ด๋ณด๋ฉด์„œ Open Voca๋กœ ํ•ด๊ฒฐ!!

Image

  • Image Encoder (CLIP + CLIP2SAM)
    • CLIP์˜ ๋น„์ „ ์ธ์ฝ”๋”๋ฅผ ์‚ฌ์šฉ, CLIP feature โ†’ CLIP2SAM projection์„ ๊ฑฐ์ณ SAM Decoder์™€ Align
  • Prompt Encoder (SAM)
    • ๊ธฐ์กด SAM๊ณผ ๋™์ผ, point/box/mask ํ”„๋กฌํ”„ํŠธ ์ž…๋ ฅ ์ฒ˜๋ฆฌ
  • Mask Decoder (SAM)
    • CLIP2SAM feature + ํ”„๋กฌํ”„ํŠธ ๊ฒฐํ•ฉ โ†’ segmentation mask ์ƒ์„ฑ
  • Recognition Head (SAM2CLIP)
    • SAM์—์„œ ์–ป์€ ๋งˆ์Šคํฌ๋ฅผ CLIP ์ž„๋ฒ ๋”ฉ ๊ณต๊ฐ„์œผ๋กœ ํˆฌ์˜
    • ํ…์ŠคํŠธ ํ”„๋กฌํ”„ํŠธ ์ž„๋ฒ ๋”ฉ๊ณผ ์ฝ”์‚ฌ์ธ ์œ ์‚ฌ๋„ ๊ธฐ๋ฐ˜ ๋งค์นญ
    • โ†’ ๊ฒฐ๊ณผ์ ์œผ๋กœ, ๊ฐ์ฒด ๋ถ„ํ•  + ๋ผ๋ฒจ๋ง ๋™์‹œ ์ˆ˜ํ–‰
  • CLIP2SAM์€ โ€œ์ธ์‹โ†’๋ถ„ํ• โ€ ์—ฐ๊ฒฐ๊ณ ๋ฆฌ, SAM2CLIP์€ โ€œ๋ถ„ํ• โ†’์ธ์‹โ€ ์—ฐ๊ฒฐ๊ณ ๋ฆฌ ์—ญํ• !!

๐Ÿ”ง ํ•™์Šต๋ฒ•(Training Recipe)

  • 1๋‹จ๊ณ„: SA-1B (1%)๋กœ SAM2CLIP ํ•™์Šต (์ง€์‹ ์ „์ด, distillation loss)
    • SAM ์ธ์ฝ”๋”์— ๋„ฃ์–ด์„œ feature (F_{sam}) ์ถ”์ถœ
    • CLIP ๋น„์ „ ์ธ์ฝ”๋”์— ๋„ฃ์–ด์„œ feature (E_I) ์ถ”์ถœ
    • Transformer layer๋กœ ๊ตฌ์„ฑ๋œ Adapter๊ฐ€ CLIP feature๋ฅผ SAM feature์— ๋งž์ถ”์–ด segmentation ์„ฑ๋Šฅ ๋ณด์กด (์ง€์‹ ์ฆ๋ฅ˜)
\[L_{distill} = \mathrm{MSE}\!\left(F_{sam}, A_{sam2clip}\!\left(\mathrm{Fusion}\!\left(E_I^i\right)\right)\right)\]
  • 2๋‹จ๊ณ„: COCO/LVIS ๋ฐ์ดํ„ฐ๋กœ CLIP2SAM + Mask Decoder ๊ณต๋™ ํ•™์Šต (segmentation loss๋“ค ์‚ฌ์šฉ)
    • CLIP2SAM : CLIP์˜ semantic feature๋ฅผ SAM ๋””์ฝ”๋”๊ฐ€ ์“ฐ๊ธฐ ์ข‹์€ ํ˜•ํƒœ๋กœ ๋ณ€ํ™˜
      1. ์ด๋ฏธ์ง€ โ†’ CLIP ์ธ์ฝ”๋”(๊ณ ์ •) โ†’ multi-scale feature ์ถ”์ถœ
      2. ํ”„๋กฌํ”„ํŠธ(point/box) ์ž…๋ ฅ โ†’ Prompt Encoder (SAM)
      3. CLIP2SAM(+FPN): multi-scale CLIP feature โ†’ SAM ํ˜ธํ™˜ region feature๋กœ ๋ณ€๊ฒฝ
      4. Mask Decoder (SAM): Prompt embedding + ๋ณ€ํ™˜๋œ CLIP feature๋ฅผ ์ž…๋ ฅ๋ฐ›์•„ ๋งˆ์Šคํฌ/IoU ์˜ˆ์ธก
      5. Recognition Head: Q_label vs CLIP ํ…์ŠคํŠธ ์ž„๋ฒ ๋”ฉ ์œ ์‚ฌ๋„ โ†’ ๋ผ๋ฒจ ์Šค์ฝ”์–ด
  • ์ถ”๊ฐ€: ImageNet๊นŒ์ง€ ๊ฐ™์ด ํ•™์Šต โ†’ 22K ํด๋ž˜์Šค ๋ถ„๋ฅ˜ ํ™•์žฅ

๐Ÿงช ์‹คํ—˜ ๊ฒฐ๊ณผ

๐ŸŽฏ Open-Vocabulary Segmentation

Image

  • COCO (IoU_b=81.5 / IoU_n=84.0), LVIS (IoU_b=83.1 / IoU_n=83.6)์œผ๋กœ,
    • ๊ธฐ์กด baseline ๋Œ€๋น„ base/novel ํด๋ž˜์Šค ๋ชจ๋‘์—์„œ ๊ท ํ˜•์  ์„ฑ๋Šฅ ํ–ฅ์ƒ.
  • FLOPs(1,180G)์™€ ํŒŒ๋ผ๋ฏธํ„ฐ ์ˆ˜(304M) ์—ญ์‹œ ํฌ๊ฒŒ ์ค„์–ด, ํšจ์œจ์„ฑ๊ณผ ์„ฑ๋Šฅ์„ ๋™์‹œ์— ํ™•๋ณด.
  • *๊ฐ€ ๋ถ™์€ ์กฐ๊ฑด(mask center point prompt)์—์„œ๋Š” ๋ชจ๋“  baseline์ด ์„ฑ๋Šฅ ๊ธ‰๋ฝ.
    • ํŠนํžˆ Image-Crop baseline*์€ COCO IoU_n 26.4, LVIS IoU_n 2.3์œผ๋กœ ๋งค์šฐ ์ €์กฐ.
  • ๋ฐ˜๋ฉด Open-Vocabulary SAM*์€ IoU_b=63.6, IoU_n=67.9๋กœ ์—ฌ์ „ํžˆ ์•ˆ์ •์  ์„ฑ๋Šฅ.
  • ์ฆ‰, ํ”„๋กฌํ”„ํŠธ ์ œ์•ฝ์ด ์‹ฌํ•ด์ ธ๋„ ์ œ์•ˆ ๊ธฐ๋ฒ•์˜ ๊ฒฌ๊ณ ์„ฑ์ด ์ž…์ฆ๋จ.

๐ŸŽฏ Segmentation

Image

  • mask Quality์—์„œ๋„ SAM-H์™€ ๊ฑฐ์˜ ์œ ์‚ฌํ–ˆ๋‹ค!! ํŒŒ๋ผ๋ฏธํ„ฐ๋Š” ๋ฐ˜์ธ๋ฐ!

Image

  • ๊ทธ ์™ธ์—๋„ BBox-detector ํ›„ label ๊ตฌ๋ถ„๊ฒฐ๊ณผ, ๋‹ค๋ฅธ segmentation ๋ชจ๋ธ๊ณผ์˜ ๋น„๊ตํ•ด๋„ ์„ฑ๋Šฅ์ด ์ข‹์•˜๋‹ค!!

๐Ÿ‘€ ์ •์„ฑ ๋น„๊ต

Image

  • Bbox, point ์—์„œ๋„ segment ๋ฐ label์„ ์ž˜ํ•จ!
  • Everything ๋ชจ๋“œ์—์„œ ์ถ”์ถœ๋œ ์ˆ˜์‹ญ ๊ฐœ ๋งˆ์Šคํฌ์— ์ž๋™ ๋ผ๋ฒจ๋ง ๋ถ€์—ฌ ๊ฐ€๋Šฅ
  • ์˜ˆ: โ€œ๊ณ ์–‘์ดโ€, โ€œ๊ฐ•์•„์ง€โ€, โ€œ์†ŒํŒŒโ€๋ฅผ ์ž๋™์œผ๋กœ ๊ตฌ๋ถ„
  • ์ƒํ˜ธ์ž‘์šฉํ˜• ํ•™์Šต ๋„๊ตฌ, ๋กœ๋ณดํ‹ฑ์ŠคยทAR, ์ ‘๊ทผ์„ฑ ๊ธฐ์ˆ ์— ์ฆ‰์‹œ ํ™œ์šฉ ๊ฐ€๋Šฅ

๐Ÿงช Ablation ๋ถ„์„

  • Recognition Head์˜ ํ…์ŠคํŠธ ์ž„๋ฒ ๋”ฉ ์ •๋ฐ€๋„๊ฐ€ ์ค‘์š” โ†’ CLIP ๊ธฐ๋ฐ˜ ํ•™์Šต์ด ๊ฐ€์žฅ ์•ˆ์ •์ 
  • ๋งˆ์Šคํฌโ€“ํ…์ŠคํŠธ ์ •๋ ฌ ์‹œ, IoU + Text Similarity Joint Loss๊ฐ€ ์„ฑ๋Šฅ ํ–ฅ์ƒ์— ๊ธฐ์—ฌ
  • ํด๋ž˜์Šค ๊ฐœ์ˆ˜ ํ™•์žฅ(1์ฒœ โ†’ 2๋งŒ)์—๋„ ์ถ”๋ก  ์†๋„ ์„ ํ˜• ์ฆ๊ฐ€ โ†’ ์‹ค์‹œ๊ฐ„์„ฑ ์œ ์ง€

โœ… ๊ฒฐ๋ก 

  • Open-Vocabulary SAM์€ ๊ธฐ์กด SAM์˜ โ€œSegment Anythingโ€ ๋Šฅ๋ ฅ์— โ€œRecognize Anythingโ€์„ ๋”ํ•œ ๋ชจ๋ธ!
  • 20,000๊ฐœ ํด๋ž˜์Šค ์ด์ƒ Zero-shot ์ธ์‹ ๊ฐ€๋Šฅํ•˜๋ฉฐ, ํ”„๋กฌํ”„ํŠธ ์ƒํ˜ธ์ž‘์šฉ ํ˜ธํ™˜์„ฑ ๋•๋ถ„์— ์ฆ‰์‹œ ์‹ค์ „ ๋ฐฐ์น˜ ๊ฐ€๋Šฅ.
  • ๋‹จ์ˆœํžˆ ๋งˆ์Šคํฌ๋ฅผ ์ž๋ฅด๋Š” ๋„๊ตฌ๊ฐ€ ์•„๋‹ˆ๋ผ, ์ด๋ฆ„ ๋ถ™์ด๋Š” AI ๋น„์ „ ์‹œ์Šคํ…œ์œผ๋กœ ์ง„ํ™”ํ•œ SAM์˜ ์ƒˆ๋กœ์šด ํ‘œ์ค€!
This post is licensed under CC BY 4.0 by the author.