Post

๐Ÿง  Understanding SEEM - SEEM(Segment Everything Everywhere All at Once) ์•Œ์•„๋ณด๊ธฐ!!

๐Ÿง  Understanding SEEM - SEEM(Segment Everything Everywhere All at Once) ์•Œ์•„๋ณด๊ธฐ!!

๐Ÿง  SEEM: Segment Everything Everywhere All at Once

๐Ÿ” A universal segmentation model that handles text, clicks, boxes, and more via multimodal prompts.

Paper: SEEM: Segment Everything Everywhere All at Once Conference: NeurIPS 2024 (Zou, Xueyan, et al.) Code: UX-Decoder/Segment-Everything-Everywhere-All-At-Once Comment: All-in-one segmentation with multi-modal prompting!


๐ŸŽฏ Four Core Capabilities of SEEM

  1. ๐ŸŽ›๏ธ Versatility

    • Unifies various spatial queries (clicks, boxes, scribbles, masks) into a single visual prompt
    • Can even handle referred regions from other images
  2. ๐Ÿ”— Compositionality

    • Learns a joint visual-language embedding space for interpreting combinations of text and visual prompts
    • Freely supports prompt composition
  3. ๐Ÿ” Interactivity

    • Uses memory prompts to retain previous segmentation information
    • Optimized for iterative interaction
  4. ๐Ÿง  Semantic-awareness

    • Aligns text and mask labels in the same semantic space
    • Enables open-vocabulary segmentation (can identify unseen classes)

๐Ÿ“š Background: Why Do We Need a Universal Segmentation Model?

Image segmentation is a fundamental task in computer vision, responsible for understanding objects at the pixel level. Traditional approaches such as semantic, instance, and panoptic segmentation have laid a strong foundation. But the current trend is moving toward flexible and general-purpose segmentation models.

๐Ÿ”„ Evolution of Segmentation

  1. Closed-set โ†’ Open-vocabulary

    • Instead of recognizing fixed classes, models now use multimodal pretraining (e.g., CLIP) to generalize to unseen categories.
  2. Generic โ†’ Referring

    • Text-guided segmentation is gaining traction as it offers a more intuitive interface for users.
  3. One-shot โ†’ Interactive

    • Users provide input iteratively (clicks, boxes, etc.) to refine results step by step.

Despite these advances, many models still rely on task-specific architectures and lack the flexibility to handle diverse inputs or task switching within one system.


๐Ÿง  Meanwhile, Language Models Have Solved This

Language models like GPT-3 and T5 paved the way for unified interfaces by handling multiple NLP tasks with a single model through prompting.

However, segmentation models still face these limitations:

  • Limited prompt types (text, box, click only)
  • Outputs masks only, without semantic meaning
  • Poor generalization to new prompt combinations or domains

๐Ÿš€ Enter SEEM

SEEM addresses all these challenges with:

  • A single model that handles all types of segmentation tasks
  • Integrated support for text, visual, and memory prompts
  • Flexible prompt composition and interactive updates
  • Open-vocabulary capabilities for semantic prediction

โœ… Just like GPT understands text in context, SEEM segments the world interactively and semantically.


๐Ÿง  SEEM Model Architecture

SEEM builds on the encoder-decoder paradigm and accepts textual (Pt), visual (Pv), and memory (Pm) prompts to drive segmentation.


๐Ÿ“ฆ 1. Overall Pipeline

overal_model

1
2
3
4
5
6
7
8
Input image (I)  
โ†“  
[Image Encoder] โ†’ Feature map Z  
โ†“  
[SEEM Decoder (Queries + Prompt Interaction)]  
โ†“  
โ†’ MaskPredictor โ†’ Output mask M  
โ†’ ConceptClassifier โ†’ Semantic label C

๐Ÿงฑ 2. Key Components

(1) Image Encoder

  • Input: I โˆˆ โ„^{Hร—Wร—3}
  • Output: visual feature map Z
  • Uses Vision Transformer variants

(2) Prompts

  • Pt: Text prompts (natural language commands)
  • Pv: Visual prompts (clicks, boxes, scribbles, masks, referred images)
  • Pm: Memory prompts (track previous interaction results)

(3) Learnable Queries (Qh)

  • Trainable tokens to query outputs (mask and class)
  • Duplicated per task during training (generic, referring, interactive)

๐Ÿ”„ 3. Decoder Operations

model_detail

(1) Query-prompt interaction

1
<Om_h, Oc_h> = Decoder(Qh ; <Pt, Pv, Pm> | Z)
  • Om_h: Embedding for segmentation masks
  • Oc_h: Embedding for semantic concepts

(2) Mask prediction

1
M = MaskPredictor(Om_h)

(3) Concept classification

1
C = ConceptClassifier(Oc_h)

๐ŸŒ 4. Key Capabilities

๐Ÿงฉ Versatile: Unify Diverse Inputs
  • SEEM handles all non-text prompts as visual prompts (Pv) (clicks, boxes, scribbles, reference image)
  • These inputs are projected into a shared visual space via a Visual Sampler
  • Enables seamless composition with text inputs
๐Ÿง  Compositional: Flexible Prompt Combinations
  • Supports mixing visual and text prompts during inference
  • Visual prompts align with Om_h, text prompts align with Oc_h
  • Uses IOUmask (IoU between predicted and GT masks) for more accurate matching
  • Trained to generalize to unseen prompt combinations
๐Ÿ”„ Interactive: Iterative Refinement
  • Introduces Pm (memory prompts) to carry forward context
  • MaskedCrossAtt(Pm; Mp | Z) updates memory based on previous mask
  • Efficiently supports multi-round segmentation without re-encoding the image
๐Ÿง  Semantic-aware: Meaningful Predictions
  • Unlike class-agnostic models, SEEM predicts semantic labels for masks
  • Thanks to alignment in visual-language embedding space (zero-shot capable)
  • No semantic label training required for interactive tasks

๐Ÿงช Experiments Summary

SEEM demonstrates strong performance across four segmentation settings with a single unified model.


๐Ÿ“‚ Datasets and Setup

  • Trained on:

    • COCO2017 (Panoptic & Interactive)
    • RefCOCO, RefCOCO+, RefCOCOg (Referring)
  • Backbone: FocalT, DaViT-d3/d5

  • Language Encoder: UniCL / Florence

  • Metrics: PQ, AP, mIoU, NoC\@85/90, 1-IoU, K-NoC\@90, Zero-shot VOS


๐Ÿ” Main Results

  • Generic: +10 points on Panoptic PQ vs. Pix2Seqv2, SegGPT, etc.
  • Referring: With visual prompt: +10.5 cIoU, +6.0 mIoU, +9.3 AP50 (Tiny model)
  • Interactive: Better than SAM using 100x less data; supports diverse prompts
  • Video: Zero-shot VOS (DAVIS17) + 1-click interactive DAVIS16

๐Ÿ“Š Summary Table

Task TypeSEEM Highlights
Generic Segmentation+10 PQ over baselines
Referring Segmentation+10.5 cIoU / +6.0 mIoU / +9.3 AP50 with visual prompt
Interactive SegmentationOutperforms SAM, supports text, box, click, scribble, polygon inputs
Video SegmentationZero-shot DAVIS17, strong 1-click interactive performance on DAVIS16

๐Ÿ“ Conclusion

Ultimately, SEEM behaves like a visual ChatGPT โ€” segmenting from multimodal prompts and refining results interactively.

Segmentation with meaning โ€” not just masks, but concepts too!

results1

Like SAM2, SEEM tracks objects across video frames without retraining!

results2


Stay tuned for future experiments and hands-on tutorials!


๐Ÿง  (ํ•œ๊ตญ์–ด) SEEM: Segment Everything Everywhere All at Once

๐Ÿ” ํ…์ŠคํŠธ, ํด๋ฆญ, ๋ฐ•์Šค ๋ฌด์—‡์ด๋“  OK! ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ํ”„๋กฌํ”„ํŠธ๋กœ ์„ธ์ƒ์„ ๋ถ„ํ• ํ•˜๋Š” ๋ฒ”์šฉ ์„ธ๊ทธ๋ฉ˜ํ…Œ์ด์…˜ ๋ชจ๋ธ

๋…ผ๋ฌธ: SEEM: Segment Everything Everywhere All at Once ๋ฐœํ‘œ: NeurIPS 2024 (Zou, Xueyan, et al.)
์ฝ”๋“œ: UX-Decoder/Segment-Everything-Everywhere-All-At-Once
์ฝ”๋ฉ˜ํŠธ: Multi-modal prompt๋กœ ๋ชจ๋“ ๊ฒƒ์„ ํ•œ๋ฒˆ์— segmentation!!


๐ŸŽฏ SEEM์˜ 4๊ฐ€์ง€ ํ•ต์‹ฌ์„ฑ๋Šฅ!!!

  1. ๐ŸŽ›๏ธ ๋‹ค์žฌ๋‹ค๋Šฅ์„ฑ (Versatility)
    • ํด๋ฆญ, ๋ฐ•์Šค, ๋‚™์„œ, ๋งˆ์Šคํฌ ๋“ฑ ๋‹ค์–‘ํ•œ ์งˆ์˜๋ฅผ ํ•˜๋‚˜์˜ ๋น„์ฃผ์–ผ ํ”„๋กฌํ”„ํŠธ๋กœ ํ†ตํ•ฉ
    • ์ฐธ์กฐ ์ด๋ฏธ์ง€๊นŒ์ง€ ํ™œ์šฉ ๊ฐ€๋Šฅํ•œ ํ™•์žฅ์„ฑ
  2. ๐Ÿ”— ๊ตฌ์„ฑ ๊ฐ€๋Šฅ์„ฑ (Compositionality)
    • ํ…์ŠคํŠธ์™€ ์ด๋ฏธ์ง€ ํ”„๋กฌํ”„ํŠธ๋ฅผ ํ•จ๊ป˜ ํ•ด์„ํ•  ์ˆ˜ ์žˆ๋Š” ๊ณต๋™ ์‹œ๊ฐ-์–ธ์–ด ๊ณต๊ฐ„ ํ•™์Šต
    • ํ”„๋กฌํ”„ํŠธ์˜ ์ž์œ ๋กœ์šด ์กฐํ•ฉ ๊ฐ€๋Šฅ
  3. ๐Ÿ” ์ƒํ˜ธ์ž‘์šฉ์„ฑ (Interactivity)
    • ๋ฉ”๋ชจ๋ฆฌ ํ”„๋กฌํ”„ํŠธ๋ฅผ ํ†ตํ•ด ์ด์ „ ๋ถ„ํ•  ์ •๋ณด๋ฅผ ๊ธฐ์–ต
    • ์‚ฌ์šฉ์ž์™€์˜ ๋ฐ˜๋ณต์  ์ƒํ˜ธ์ž‘์šฉ์— ์ตœ์ ํ™”
  4. ๐Ÿง  ์˜๋ฏธ ์ธ์‹ (Semantic-awareness)
    • ํ…์ŠคํŠธ์™€ ๋งˆ์Šคํฌ ๋ผ๋ฒจ์„ ๊ฐ™์€ ์˜๋ฏธ ๊ณต๊ฐ„์— ์ธ์ฝ”๋”ฉ
    • Open-vocabulary segmentation ๊ฐ€๋Šฅ (์ƒˆ๋กœ์šด ํด๋ž˜์Šค๋„ ์ธ์‹)

๐Ÿ“š SEEM ๋“ฑ์žฅ ๋ฐฐ๊ฒฝ: ์™œ ๋ฒ”์šฉ ์„ธ๋ถ„ํ™” ๋ชจ๋ธ์ด ํ•„์š”ํ•œ๊ฐ€?

์ด๋ฏธ์ง€ ์„ธ๋ถ„ํ™”๋Š” ์ปดํ“จํ„ฐ ๋น„์ „์˜ ํ•ต์‹ฌ ๊ณผ์ œ๋กœ, ํ”ฝ์…€ ๋‹จ์œ„ ์ˆ˜์ค€์—์„œ ์‚ฌ๋ฌผ์„ ์‹๋ณ„ํ•˜๊ณ  ๊ตฌ์กฐํ™”ํ•˜๋Š” ์ค‘์š”ํ•œ ์ž‘์—…์ž…๋‹ˆ๋‹ค.
๊ทธ๋™์•ˆ์€ Semantic Segmentation, Instance Segmentation, Panoptic Segmentation ๋“ฑ ๋‹ค์–‘ํ•œ ์ ‘๊ทผ ๋ฐฉ์‹์—์„œ ์—ฐ๊ตฌ๊ฐ€ ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค.
ํ•˜์ง€๋งŒ ์ตœ๊ทผ ๋น„์ „ AI์˜ ํ๋ฆ„์€ ๋‹จ์ˆœํ•œ ์ •ํ™•๋„๋ฅผ ๋„˜์–ด, ๋” ์œ ์—ฐํ•˜๊ณ  ๋ฒ”์šฉ์ ์ธ ์„ธ๋ถ„ํ™” ๋ชจ๋ธ์„ ํ–ฅํ•ด ๊ฐ€๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

๐Ÿ”„ ์„ธ๋ถ„ํ™”์˜ ์ง„ํ™” ๋ฐฉํ–ฅ

์ตœ๊ทผ ์„ธ๋ถ„ํ™” ์—ฐ๊ตฌ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์„ธ ๊ฐ€์ง€ ๋ฐฉํ–ฅ์œผ๋กœ ๋น ๋ฅด๊ฒŒ ํ™•์žฅ๋˜๊ณ  ์žˆ์–ด์š”!!

  1. ํ์‡„ํ˜•์—์„œ ๊ฐœ๋ฐฉํ˜• ์„ธ๋ถ„ํ™”๋กœ (Closed-set โ†’ Open-vocabulary)
    • ๊ธฐ์กด ๋ชจ๋ธ์€ ๋ฏธ๋ฆฌ ์ •์˜๋œ ํด๋ž˜์Šค๋งŒ ์ธ์‹ํ–ˆ์ง€๋งŒ, ์ตœ๊ทผ์—๋Š” CLIP ๊ฐ™์€ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ์‚ฌ์ „ํ•™์Šต ๋ชจ๋ธ์„ ํ™œ์šฉํ•ด ์ƒˆ๋กœ์šด ๊ฐœ๋…๊นŒ์ง€ ์ธ์‹ํ•˜๋ ค๋Š” ์‹œ๋„๊ฐ€ ํ™œ๋ฐœํ•ด์ง€๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.
  2. ์ผ๋ฐ˜ ์„ธ๋ถ„ํ™”์—์„œ ์ฐธ์กฐ ๊ธฐ๋ฐ˜ ์„ธ๋ถ„ํ™”๋กœ (Generic โ†’ Referring)
    • ํ…์ŠคํŠธ ๋ฌธ๊ตฌ๋กœ ํŠน์ • ์˜์—ญ์„ ์ง€์‹œํ•˜๊ณ  ๋ถ„ํ• ํ•˜๋Š” ์‚ฌ์šฉ์ž ์นœํ™”์  ์ธํ„ฐํŽ˜์ด์Šค๊ฐ€ ์ฃผ๋ชฉ๋ฐ›๊ณ  ์žˆ์œผ๋ฉฐ, ์–ธ์–ด ๊ธฐ๋ฐ˜ ์ง€์‹œ๋ฅผ ์ •ํ™•ํžˆ ๋ฐ˜์˜ํ•˜๋Š” ๋ชจ๋ธ์ด ํ•„์š”ํ•ด์กŒ์Šต๋‹ˆ๋‹ค.
  3. ๋‹จ๋ฐœ ์‹คํ–‰์—์„œ ์ƒํ˜ธ์ž‘์šฉ ์„ธ๋ถ„ํ™”๋กœ (One-shot โ†’ Interactive)
    • ํด๋ฆญ, ๋ฐ•์Šค ๋“ฑ ๋‹ค์–‘ํ•œ ์ž…๋ ฅ์„ ๋ฐ˜๋ณต ์ œ๊ณตํ•˜๋ฉฐ ๊ฒฐ๊ณผ๋ฅผ ์ ์ง„์ ์œผ๋กœ ๊ฐœ์„ ํ•  ์ˆ˜ ์žˆ๋Š” ์ƒํ˜ธ์ž‘์šฉ ๊ธฐ๋ฐ˜ ๋ชจ๋ธ์ด ์ค‘์š”ํ•ด์ง€๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

์ด๋Ÿฌํ•œ ๋ฐœ์ „์€ ์„ธ๋ถ„ํ™” ๋ชจ๋ธ์„ ๋ณด๋‹ค ์‹ค์šฉ์ ์œผ๋กœ ๋งŒ๋“ค์—ˆ์ง€๋งŒ, ์—ฌ์ „ํžˆ ๊ฐ ์ž‘์—…๋งˆ๋‹ค ๋ชจ๋ธ ๊ตฌ์กฐ๊ฐ€ ๋ถ„๋ฆฌ๋˜์–ด ์žˆ์œผ๋ฉฐ, ๋‹ค์–‘ํ•œ ์ž…๋ ฅ ๋ฐฉ์‹์ด๋‚˜ ์ž‘์—… ๊ฐ„ ์ „ํ™˜์— ์œ ์—ฐํ•˜๊ฒŒ ๋Œ€์‘ํ•˜์ง€ ๋ชปํ•˜๋Š” ํ•œ๊ณ„๊ฐ€ ์กด์žฌํ•ฉ๋‹ˆ๋‹ค.


๐Ÿง  ํ•˜์ง€๋งŒ, ์–ธ์–ด ๋ชจ๋ธ์€ ์ด๋ฏธ ํ•ด๊ฒฐํ–ˆ๋‹ค?

ํ…์ŠคํŠธ ์ฒ˜๋ฆฌ ๋ถ„์•ผ์—์„œ๋Š” GPT-3, T5 ๋“ฑ ๋Œ€ํ˜• ์–ธ์–ด ๋ชจ๋ธ(LLMs)์ด ๋“ฑ์žฅํ•˜๋ฉฐ, ๋‹ค์–‘ํ•œ ์–ธ์–ด ์ž‘์—…์„ ํ•˜๋‚˜์˜ ์ธํ„ฐํŽ˜์ด์Šค๋กœ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ๋Š” ๋ฒ”์šฉ ์–ธ์–ด ๋ชจ๋ธ์˜ ์‹œ๋Œ€๊ฐ€ ์—ด๋ ธ์Šต๋‹ˆ๋‹ค.

  • ํ”„๋กฌํ”„ํŠธ๋งŒ ๋ฐ”๊พธ๋ฉด ์งˆ์˜์‘๋‹ต, ๋ฒˆ์—ญ, ์š”์•ฝ, ๋Œ€ํ™” ๋“ฑ ๋‹ค์–‘ํ•œ ์ž‘์—…์„ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ๋Š” ๊ตฌ์กฐ๊ฐ€ ํ‘œ์ค€์ด ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

๊ทธ๋Ÿฌ๋‚˜ ์‹œ๊ฐ ์„ธ๋ถ„ํ™” ๋ถ„์•ผ๋Š” ์—ฌ์ „ํžˆ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์ œํ•œ์ด ์žˆ์Šต๋‹ˆ๋‹ค:

  • ํด๋ฆญ, ๋ฐ•์Šค, ํ…์ŠคํŠธ ๋“ฑ ํ”„๋กฌํ”„ํŠธ๊ฐ€ ์ œํ•œ์ 
  • ๋ชจ๋ธ์ด ์˜๋ฏธ ์—†๋Š” ๋งˆ์Šคํฌ๋งŒ ์ƒ์„ฑ (์˜ˆ: SAM)
  • ์ƒˆ๋กœ์šด ์ž‘์—… ์กฐํ•ฉ์ด๋‚˜ ๋„๋ฉ”์ธ์— ๋Œ€ํ•œ ์ผ๋ฐ˜ํ™” ๋ถ€์กฑ

๐Ÿš€ ๊ทธ๋ž˜์„œ ๋“ฑ์žฅํ•œ SEEM

์ด๋Ÿฌํ•œ ๋ฐฐ๊ฒฝ ์†์—์„œ ๋“ฑ์žฅํ•œ ๊ฒƒ์ด ๋ฐ”๋กœ SEEM์ž…๋‹ˆ๋‹ค.

SEEM์€ ๊ธฐ์กด์˜ ๋ฌธ์ œ์ ์„ ์ •๋ฉด์œผ๋กœ ๋ŒํŒŒํ•˜๋ฉฐ ๋‹ค์Œ์„ ๋ชฉํ‘œ๋กœ ํ•ฉ๋‹ˆ๋‹ค:

  • ํ•˜๋‚˜์˜ ๋ชจ๋ธ์ด ๋ชจ๋“  ์ข…๋ฅ˜์˜ ์„ธ๋ถ„ํ™” ์ž‘์—…์„ ์ฒ˜๋ฆฌ
  • ํด๋ฆญ, ๋ฐ•์Šค, ๋งˆ์Šคํฌ, ํ…์ŠคํŠธ, ์ฐธ์กฐ ์ด๋ฏธ์ง€ ๋“ฑ ๋ชจ๋“  ํ”„๋กฌํ”„ํŠธ๋ฅผ ํ†ตํ•ฉ
  • ํ”„๋กฌํ”„ํŠธ ๊ฐ„ ์กฐํ•ฉ์ด ์ž์œ ๋กญ๊ณ , ์ด์ „ ์ด๋ ฅ๊นŒ์ง€ ๊ธฐ์–ตํ•˜๋Š” ์ƒํ˜ธ์ž‘์šฉ์„ฑ
  • ์˜๋ฏธ ์žˆ๋Š” ๋ผ๋ฒจ๊นŒ์ง€ ์ œ๊ณตํ•˜๋Š” Open Vocabulary ๋Œ€์‘

SEEM์€ ๋งˆ์น˜ LLM์ด ํ…์ŠคํŠธ๋ฅผ ๋‹ค๋ฃจ๋“ฏ, ์‹œ๊ฐ ์„ธ๋ถ„ํ™”์—์„œ ์ง„์ •ํ•œ ๋ฒ”์šฉ ์ธํ„ฐํŽ˜์ด์Šค๋ฅผ ์‹คํ˜„ํ•˜๋ ค๋Š” ๊ฐ•๋ ฅํ•œ ์‹œ๋„์ž…๋‹ˆ๋‹ค.

โœ… SEEM์€ โ€œํ…์ŠคํŠธ๋กœ ์ง€์‹œํ•˜๊ณ , ํด๋ฆญ์œผ๋กœ ์ˆ˜์ •ํ•˜๋ฉฐ, ์ด์ „ ํžˆ์Šคํ† ๋ฆฌ๊นŒ์ง€ ๊ธฐ์–ตํ•˜๋Š”โ€
์ง„์ •ํ•œ ๋ฒ”์šฉ ์„ธ๋ถ„ํ™” ํ”„๋ ˆ์ž„์›Œํฌ์ž…๋‹ˆ๋‹ค.

๐Ÿง  SEEM ๋ชจ๋ธ์˜ ๊ตฌ์กฐ

SEEM์€ ์ „ํ†ต์ ์ธ Encoder-Decoder ๊ตฌ์กฐ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•˜๋ฉด์„œ,
ํ…์ŠคํŠธ(Text), ์‹œ๊ฐ(Visual), ๋ฉ”๋ชจ๋ฆฌ(Memory) ํ”„๋กฌํ”„ํŠธ๋ฅผ ๋ชจ๋‘ ์ˆ˜์šฉํ•  ์ˆ˜ ์žˆ๋Š”
๋ฒ”์šฉ์ ์ธ ์„ธ๋ถ„ํ™” ๋ชจ๋ธ๋กœ ์„ค๊ณ„๋˜์—ˆ์Šต๋‹ˆ๋‹ค.


๐Ÿ“ฆ 1. ์ „์ฒด ์ฒ˜๋ฆฌ ํ๋ฆ„

overal_model

1
2
3
4
5
6
7
8
์ž…๋ ฅ ์ด๋ฏธ์ง€ (I)  
โ†“  
[Image Encoder] โ†’ ์ด๋ฏธ์ง€ ํŠน์ง• Z ์ถ”์ถœ  
โ†“  
[SEEM Decoder (์ฟผ๋ฆฌ + ํ”„๋กฌํ”„ํŠธ ์ƒํ˜ธ์ž‘์šฉ)]  
โ†“  
โ†’ MaskPredictor โ†’ ๋งˆ์Šคํฌ ์ถœ๋ ฅ M  
โ†’ ConceptClassifier โ†’ ์˜๋ฏธ ๋ผ๋ฒจ ์ถœ๋ ฅ C

์ตœ์ข…์ ์œผ๋กœ ์ž…๋ ฅ์ด๋ฏธ์ง€(I)์™€ ๋‹ค์–‘ํ•œ ํ˜•์‹์˜ ํ”„๋กฌํฌํŠธ(P) ๋ฅผ ๋ฐ›์•„,
Decoder๋ฅผ ํ†ตํ•ด segmantation Mask(M)๊ณผ ํ•ด๋‹น ๋งˆ์Šคํฌ์˜ ์˜๋ฏธ(C) ์ถœ๋ ฅํ•˜๊ฒŒ๋ฉ๋‹ˆ๋‹ค!!


๐Ÿงฑ 2. ๊ตฌ์„ฑ ์š”์†Œ ์„ค๋ช…

(1) Image Encoder

  • ์ž…๋ ฅ ์ด๋ฏธ์ง€ I โˆˆ โ„^{Hร—Wร—3}
  • ์‹œ๊ฐ ํŠน์ง• ๋ฒกํ„ฐ Z๋ฅผ ์ถ”์ถœ
  • Vision Transformer ๊ณ„์—ด ๊ตฌ์กฐ ์‚ฌ์šฉ ๊ฐ€๋Šฅ

(2) Prompts (ํ”„๋กฌํ”„ํŠธ ์œ ํ˜•)

  • Pt : Text Prompt (์ž์—ฐ์–ด ๋ช…๋ น)
  • Pv : Visual Prompt (point, box, scribble, mask, referred region)
  • Pm : Memory Prompt (๊ณผ๊ฑฐ ์„ธ๋ถ„ํ™” ์ •๋ณด๋ฅผ ์ €์žฅ)

(3) Learnable Queries (Qh)

  • ๋งˆ์Šคํฌ ๋ฐ ํด๋ž˜์Šค ์ถœ๋ ฅ์„ ์œ„ํ•œ ํ•™์Šต ๊ฐ€๋Šฅํ•œ ์ฟผ๋ฆฌ
  • ํ•™์Šต ์‹œ์—๋Š” ์ผ๋ฐ˜/์ฐธ์กฐ/์ƒํ˜ธ์ž‘์šฉ ์„ธ๋ถ„ํ™”์— ๋”ฐ๋ผ Qh๊ฐ€ ๋ณต์ œ๋จ

๐Ÿ”„ 3. ๋””์ฝ”๋” ์ž‘๋™ ๋ฐฉ์‹

model_detail

(1) ์ฟผ๋ฆฌ-ํ”„๋กฌํ”„ํŠธ ์ƒํ˜ธ์ž‘์šฉ

1
โŸจOm_h, Oc_hโŸฉ = Decoder(Qh ; โŸจPt, Pv, PmโŸฉ | Z)
  • Om_h : Mask์— ๋Œ€ํ•œ embedding
  • Oc_h : Class ์„ค๋ช…์— ๋Œ€ํ•œ embedding

(2) ๋งˆ์Šคํฌ ์˜ˆ์ธก

1
M = MaskPredictor(Om_h)

(3) ์˜๋ฏธ ํด๋ž˜์Šค ์˜ˆ์ธก

1
C = ConceptClassifier(Oc_h)

๐ŸŒ 4. ์ฃผ์š” ํŠน์„ฑ


๐Ÿงฉ Versatile: ๋‹ค์–‘ํ•œ ์ž…๋ ฅ์„ ํ•˜๋‚˜๋กœ ํ†ตํ•ฉ
  • SEEM์€ ํด๋ฆญ, ๋ฐ•์Šค, ๋‚™์„œ, ์ฐธ์กฐ ์ด๋ฏธ์ง€ ๋“ฑ ํ…์ŠคํŠธ๊ฐ€ ์•„๋‹Œ ๋ชจ๋“  ์ž…๋ ฅ์„ ํ•˜๋‚˜์˜ ์‹œ๊ฐ ํ”„๋กฌํ”„ํŠธ(Pv)๋กœ ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค.
  • ๊ธฐ์กด ๋ฐฉ์‹๋“ค๊ณผ ๋‹ฌ๋ฆฌ, ๊ฐ ์ž…๋ ฅ ์œ ํ˜•๋งˆ๋‹ค ๋ณ„๋„ ๊ตฌ์กฐ๋ฅผ ๋‘์ง€ ์•Š๊ณ  Visual Sampler๋ฅผ ํ†ตํ•ด ๋ชจ๋“  ๋น„ํ…์ŠคํŠธ ์ž…๋ ฅ์„ ๋™์ผํ•œ ํ‘œํ˜„ ๊ณต๊ฐ„์— ์ •๋ ฌํ•ฉ๋‹ˆ๋‹ค.
  • ์ด ๋•๋ถ„์— ํ…์ŠคํŠธ + ์‹œ๊ฐ ํ”„๋กฌํ”„ํŠธ๊ฐ€ ์ž์—ฐ์Šค๋Ÿฝ๊ฒŒ ์กฐํ•ฉ๋  ์ˆ˜ ์žˆ์œผ๋ฉฐ, ์‚ฌ์šฉ์ž ์˜๋„๋„ ๋” ์ •ํ™•ํ•˜๊ฒŒ ๋ฐ˜์˜๋ฉ๋‹ˆ๋‹ค.

๐Ÿง  Compositional: ํ”„๋กฌํ”„ํŠธ ์กฐํ•ฉ์— ์œ ์—ฐํ•˜๊ฒŒ ๋Œ€์‘
  • ์‚ฌ์šฉ์ž๋Š” ์‹ค์ œ๋กœ ํ…์ŠคํŠธ์™€ ์‹œ๊ฐ ํ”„๋กฌํ”„ํŠธ๋ฅผ ํ•จ๊ป˜ ์‚ฌ์šฉํ•˜๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋งŽ์Šต๋‹ˆ๋‹ค.
  • SEEM์€ ์„œ๋กœ ๋‹ค๋ฅธ ์ข…๋ฅ˜์˜ ํ”„๋กฌํ”„ํŠธ๊ฐ€ ํ•จ๊ป˜ ์ œ๊ณต๋˜๋”๋ผ๋„ ์ด๋ฅผ ์„œ๋กœ ๋‹ค๋ฅธ ์ถœ๋ ฅ(target)์— ๋งž์ถฐ ์ •๋ ฌํ•จ์œผ๋กœ์จ ํ‘œํ˜„ ๊ณต๊ฐ„ ๊ฐ„ ์ฐจ์ด๋ฅผ ๊ทน๋ณตํ•ฉ๋‹ˆ๋‹ค.
  • ๊ตฌ์ฒด์ ์œผ๋กœ, ์‹œ๊ฐ ํ”„๋กฌํ”„ํŠธ(Pv)๋Š” ๋งˆ์Šคํฌ ์ž„๋ฒ ๋”ฉ(Omโ‚•)๊ณผ, ํ…์ŠคํŠธ ํ”„๋กฌํ”„ํŠธ(Pt)๋Š” ํด๋ž˜์Šค ์ž„๋ฒ ๋”ฉ(Ocโ‚•)๊ณผ ๊ฐ๊ฐ ์ •๋ ฌ๋ฉ๋‹ˆ๋‹ค.
  • ์ด๋•Œ ์ •๋ ฌ ๊ธฐ์ค€์œผ๋กœ ์‚ฌ์šฉ๋˜๋Š” IOUmask๋Š” ์˜ˆ์ธก๋œ ๋งˆ์Šคํฌ์™€ ์‹ค์ œ ๋งˆ์Šคํฌ ๊ฐ„์˜ ๊ฒน์นจ ์ •๋„(IoU: Intersection over Union)๋ฅผ ํ™œ์šฉํ•˜์—ฌ, ์–ด๋–ค ํ”„๋กฌํ”„ํŠธ๊ฐ€ ์–ด๋–ค ์ถœ๋ ฅ๊ณผ ์ž˜ ๋งž๋Š”์ง€๋ฅผ ํŒ๋‹จํ•˜๋Š” ๋ฐ ๋„์›€์„ ์ค๋‹ˆ๋‹ค.
  • ํ•™์Šต ํ›„์—๋Š” ํ”„๋กฌํ”„ํŠธ๊ฐ€ ์—†๊ฑฐ๋‚˜, ํ•˜๋‚˜๋งŒ ์ฃผ์–ด์ง€๊ฑฐ๋‚˜, ๋˜๋Š” ๋‘˜ ๋ชจ๋‘ ์ฃผ์–ด์ ธ๋„ ํ•˜๋‚˜์˜ ๋ชจ๋ธ๋กœ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ์œผ๋ฉฐ,
    ํ•™์Šต ์ค‘ ๋ณธ ์  ์—†๋Š” ์กฐํ•ฉ์—๋„ ์ผ๋ฐ˜ํ™”๋ฉ๋‹ˆ๋‹ค.

๐Ÿ”„ Interactive: ๋ฐ˜๋ณต ์ƒํ˜ธ์ž‘์šฉ์œผ๋กœ ์ ์ง„์  ์„ธ๋ถ„ํ™”
  • SEEM์€ Pm์ด๋ผ๋Š” ๋ฉ”๋ชจ๋ฆฌ ํ”„๋กฌํ”„ํŠธ(memory prompt)๋ฅผ ๋„์ž…ํ•˜์—ฌ, ์ด์ „ ๋งˆ์Šคํฌ ๊ฒฐ๊ณผ๋ฅผ ํ˜„์žฌ ์ž…๋ ฅ์— ๋ฐ˜์˜ํ•ฉ๋‹ˆ๋‹ค.
  • ์ด์ „ ๋งˆ์Šคํฌ์˜ ์ •๋ณด๋Š” ๋งˆ์Šคํฌ ๊ธฐ๋ฐ˜ ํฌ๋กœ์Šค ์–ดํ…์…˜(Masked Cross Attention)์„ ํ†ตํ•ด ํŠน์ • ์˜์—ญ ๋‚ด์—์„œ๋งŒ ๋ฐ˜์˜๋˜๋ฉฐ, ์ด๋ฅผ ํ†ตํ•ด ๋ฐ˜๋ณต์ ์ธ ์ž…๋ ฅ์— ๋”ฐ๋ฅธ ์„ธ๋ถ„ํ™” ๊ฒฐ๊ณผ์˜ ์ ์ง„์  ๊ฐœ์„ ์ด ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.
  • ๋ณ„๋„์˜ ์ถ”๊ฐ€ ๋„คํŠธ์›Œํฌ ์—†์ด ๋ฉ”๋ชจ๋ฆฌ ๊ธฐ๋Šฅ์„ ์ˆ˜ํ–‰ํ•˜์—ฌ ํšจ์œจ์„ฑ๋„ ๋›ฐ์–ด๋‚จ.

๐Ÿง  Semantic-aware: ์˜๋ฏธ ์žˆ๋Š” ์„ธ๋ถ„ํ™” ๊ฒฐ๊ณผ ์ œ๊ณต
  • ๊ธฐ์กด์˜ ์ƒํ˜ธ์ž‘์šฉ ์„ธ๋ถ„ํ™” ๋ชจ๋ธ๋“ค์€ ๋‹จ์ˆœํžˆ ๋งˆ์Šคํฌ๋งŒ ์ƒ์„ฑํ•˜์ง€๋งŒ,
    SEEM์€ ๊ฐ ๋งˆ์Šคํฌ๊ฐ€ ๋ฌด์—‡์ธ์ง€(semantic class)๊นŒ์ง€ ์˜ˆ์ธกํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • ์‹œ๊ฐ ํ”„๋กฌํ”„ํŠธ์™€ ํ…์ŠคํŠธ ์ž„๋ฒ ๋”ฉ์„ ๊ณต๋™ ์‹œ๊ฐ-์˜๋ฏธ ํ‘œํ˜„ ๊ณต๊ฐ„์— ์ •๋ ฌํ•˜์—ฌ, ํ•™์Šต ์‹œ ์˜๋ฏธ ๋ผ๋ฒจ์„ ์‚ฌ์šฉํ•˜์ง€ ์•Š์•˜๋”๋ผ๋„ ์ œ๋กœ์ƒท์œผ๋กœ ์˜๋ฏธ๋ฅผ ๋ถ„๋ฅ˜ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • ๋•๋ถ„์— SEEM์€ ๋‹จ์ˆœํ•œ ๋ถ„ํ• ์„ ๋„˜์–ด, โ€œ๋ฌด์—‡์„ ๋ถ„ํ• ํ–ˆ๋Š”๊ฐ€โ€๊นŒ์ง€ ์„ค๋ช… ๊ฐ€๋Šฅํ•œ ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค.

๐Ÿงช 4. ์‹คํ—˜ ์š”์•ฝ

SEEM์€ ๋‹ค์–‘ํ•œ ์„ธ๋ถ„ํ™” ์ž‘์—…์„ ํ•˜๋‚˜์˜ ๋ชจ๋ธ๋กœ ํ†ตํ•ฉํ•ด ์ฒ˜๋ฆฌํ•˜๋ฉฐ,
๋‹ค์Œ ๋„ค ๊ฐ€์ง€ ์ฃผ์š” ์‹คํ—˜์—์„œ ๊ฐ•๋ ฅํ•œ ์„ฑ๋Šฅ์„ ์ž…์ฆํ–ˆ์Šต๋‹ˆ๋‹ค.


๐Ÿ“‚ ๋ฐ์ดํ„ฐ์…‹ ๋ฐ ์„ค์ •

  • ํ•™์Šต ๋Œ€์ƒ ์ž‘์—…:
    • Panoptic Segmentation (COCO2017)
    • Referring Segmentation (RefCOCO, RefCOCO+, RefCOCOg)
    • Interactive Segmentation (COCO2017 ๊ธฐ๋ฐ˜ ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ํด๋ฆญ)
  • ๋ชจ๋ธ ๊ตฌ์„ฑ:
    • Vision Backbone: FocalT, DaViT-d3/d5
    • Language Encoder: UniCL ๋˜๋Š” Florence
    • ๋””์ฝ”๋”๋Š” SEEM-Decoder๋กœ ๊ต์ฒด
  • ํ‰๊ฐ€์ง€ํ‘œ:
    • PQ (Panoptic Quality), AP (Average Precision), mIoU
    • NoC@85 / NoC@90, 1-IoU, K-NoC@90
    • Video: Zero-shot ํ‰๊ฐ€ (DAVIS17, DAVIS16-Interactive)

๐Ÿ” ์ฃผ์š” ์‹คํ—˜ ๊ฒฐ๊ณผ

  • Generic Segmentation
    • ๊ธฐ์กด ๋ฒ”์šฉ ๋ชจ๋ธ(Pix2Seqv2, Painter ๋“ฑ) ๋Œ€๋น„ Panoptic PQ +10ํฌ์ธํŠธ ํ–ฅ์ƒ
  • Referring Segmentation
    • ์‹œ๊ฐ ํ”„๋กฌํ”„ํŠธ ์ถ”๊ฐ€ ์‹œ cIoU +10.5, mIoU +6.0, AP50 +9.3 ํ–ฅ์ƒ
  • Interactive Segmentation
    • SAM๋ณด๋‹ค ์ ์€ ๋ฐ์ดํ„ฐ๋กœ ๋” ๋‚˜์€ ์„ฑ๋Šฅ
    • ๋‹ค์–‘ํ•œ ํ”„๋กฌํ”„ํŠธ(ํ…์ŠคํŠธ, ํด๋ฆญ, ๋ฐ•์Šค ๋“ฑ) ์กฐํ•ฉ ๊ฐ€๋Šฅ
  • Video Object Segmentation
    • ๊ตฌ์กฐ ๋ณ€๊ฒฝ ์—†์ด zero-shot ์ˆ˜ํ–‰
    • DAVIS17์—์„œ fully-supervised ์ˆ˜์ค€ ์„ฑ๋Šฅ
    • DAVIS16-Interactive์—์„œ ๋‹จ์ผ ํด๋ฆญ์œผ๋กœ ๊ฐ•๋ ฅํ•œ ์„ฑ๋Šฅ

๐Ÿ“Š ์„ฑ๋Šฅ ์ข…ํ•ฉ ์ •๋ฆฌ

์ž‘์—… ์œ ํ˜•SEEM ์„ฑ๋Šฅ ์š”์•ฝ
Generic Segmentation๊ธฐ์กด ๋ชจ๋ธ ๋Œ€๋น„ panoptic ์„ฑ๋Šฅ +10ํฌ์ธํŠธ
Referring Segmentation์‹œ๊ฐ ํ”„๋กฌํ”„ํŠธ ์ถ”๊ฐ€ ์‹œ cIoU +10.5, mIoU +6.0, AP50 +9.3 ์ฆ๊ฐ€
Interactive Segmentation์ ์€ ๋ฐ์ดํ„ฐ๋กœ SAM ๋Šฅ๊ฐ€, ๋‹ค์–‘ํ•œ ์ž…๋ ฅ ์ง€์› (ํ…์ŠคํŠธ, ๋ฐ•์Šค ๋“ฑ)
Video SegmentationZero-shot์œผ๋กœ DAVIS17/16 ์ˆ˜์ค€ ์„ฑ๋Šฅ, ๋ณ„๋„ ๊ตฌ์กฐ ์ˆ˜์ • ๋ถˆํ•„์š”

๐Ÿ“ ๊ฒฐ๋ก 

์ด๋ฅผ ํ†ตํ•ด์„œ!! ๊ฒฐ๊ตญ!! ํ…์ŠคํŠธ, ์ด๋ฏธ์ง€์˜ ์—ฌ๋Ÿฌ ์ธํ’‹ ๋“ฑ์„ ํ”„๋กฌํฌํŠธ๋กœ ์‚ฌ์šฉํ• ์ˆ˜ ์žˆ์œผ๋ฉฐ,
๊ณ„์† ๋Œ€ํ™”๊ฐ€ ๊ฐ€๋Šฅํ•œ chatGPT์ฒ˜๋Ÿผ SEEM์—์„œ๋„ ๊ธฐ์กด ํ”„๋กฌํฌํŠธ์— ์ด์–ด์„œ segmentation์„ ์ง„ํ–‰ํ•  ์ˆ˜ ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค!! ๋…ผ๋ฌธ์—๋Š” ์ด์— ๋Œ€ํ•œ ๋‹ค์–‘ํ•œ ๊ฒฐ๊ณผ ์ด๋ฏธ์ง€๋“ค์„ ๋ณด์—ฌ์ฃผ๋Š”๋ฐ!
์ •๋ง ํฅ๋ฏธ๋กญ๋„ค์š”!!

segmantation ์— ๋”ํ•˜์—ฌ class ์˜๋ฏธ๊นŒ์ง€!! ๋†€๋ž๋„ค์š”!! results1

์•„๋ž˜์—์„œ๋Š” SAM2์ฒ˜๋Ÿผ ๋น„๋””์˜ค์˜ ํ”„๋ ˆ์ž„์„ ์ถ”์ ํ•˜๋ฉฐ segmentation ํ•œ๋‹ค๋Š” ๊ฒƒ์ด ๋†€๋ž์Šต๋‹ˆ๋‹ค! results2

  • ์ €ํฌ๋„ ๊ณง ์‹ค์Šต์„ ํ†ตํ•ด ์•Œ์•„๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค!!

This post is licensed under CC BY 4.0 by the author.