🧠 Understanding SEEM - SEEM(Segment Everything Everywhere All at Once) 알아보기!!

Posted Jun 7, 2025

By DrFirst

21 min read

🧠 SEEM: Segment Everything Everywhere All at Once

🔍 A universal segmentation model that handles text, clicks, boxes, and more via multimodal prompts.

Paper: SEEM: Segment Everything Everywhere All at Once Conference: NeurIPS 2024 (Zou, Xueyan, et al.) Code: UX-Decoder/Segment-Everything-Everywhere-All-At-Once Comment: All-in-one segmentation with multi-modal prompting!

🎯 Four Core Capabilities of SEEM

🎛️ Versatility
- Unifies various spatial queries (clicks, boxes, scribbles, masks) into a single visual prompt
- Can even handle referred regions from other images
🔗 Compositionality
- Learns a joint visual-language embedding space for interpreting combinations of text and visual prompts
- Freely supports prompt composition
🔁 Interactivity
- Uses memory prompts to retain previous segmentation information
- Optimized for iterative interaction
🧠 Semantic-awareness
- Aligns text and mask labels in the same semantic space
- Enables open-vocabulary segmentation (can identify unseen classes)

📚 Background: Why Do We Need a Universal Segmentation Model?

Image segmentation is a fundamental task in computer vision, responsible for understanding objects at the pixel level. Traditional approaches such as semantic, instance, and panoptic segmentation have laid a strong foundation. But the current trend is moving toward flexible and general-purpose segmentation models.

🔄 Evolution of Segmentation

Closed-set → Open-vocabulary
- Instead of recognizing fixed classes, models now use multimodal pretraining (e.g., CLIP) to generalize to unseen categories.
Generic → Referring
- Text-guided segmentation is gaining traction as it offers a more intuitive interface for users.
One-shot → Interactive
- Users provide input iteratively (clicks, boxes, etc.) to refine results step by step.

Despite these advances, many models still rely on task-specific architectures and lack the flexibility to handle diverse inputs or task switching within one system.

🧠 Meanwhile, Language Models Have Solved This

Language models like GPT-3 and T5 paved the way for unified interfaces by handling multiple NLP tasks with a single model through prompting.

However, segmentation models still face these limitations:

Limited prompt types (text, box, click only)
Outputs masks only, without semantic meaning
Poor generalization to new prompt combinations or domains

🚀 Enter SEEM

SEEM addresses all these challenges with:

A single model that handles all types of segmentation tasks
Integrated support for text, visual, and memory prompts
Flexible prompt composition and interactive updates
Open-vocabulary capabilities for semantic prediction

✅ Just like GPT understands text in context, SEEM segments the world interactively and semantically.

🧠 SEEM Model Architecture

SEEM builds on the encoder-decoder paradigm and accepts textual (Pt), visual (Pv), and memory (Pm) prompts to drive segmentation.

📦 1. Overall Pipeline

Input image (I)  
↓  
[Image Encoder] → Feature map Z  
↓  
[SEEM Decoder (Queries + Prompt Interaction)]  
↓  
→ MaskPredictor → Output mask M  
→ ConceptClassifier → Semantic label C

🧱 2. Key Components

(1) Image Encoder

Input: I ∈ ℝ^{H×W×3}
Output: visual feature map Z
Uses Vision Transformer variants

(2) Prompts

Pt: Text prompts (natural language commands)
Pv: Visual prompts (clicks, boxes, scribbles, masks, referred images)
Pm: Memory prompts (track previous interaction results)

(3) Learnable Queries (Qh)

Trainable tokens to query outputs (mask and class)
Duplicated per task during training (generic, referring, interactive)

🔄 3. Decoder Operations

(1) Query-prompt interaction

<Om_h, Oc_h> = Decoder(Qh ; <Pt, Pv, Pm> | Z)

Om_h: Embedding for segmentation masks
Oc_h: Embedding for semantic concepts

(2) Mask prediction

M = MaskPredictor(Om_h)

(3) Concept classification

C = ConceptClassifier(Oc_h)

🌐 4. Key Capabilities

🧩 Versatile: Unify Diverse Inputs

SEEM handles all non-text prompts as visual prompts (Pv) (clicks, boxes, scribbles, reference image)
These inputs are projected into a shared visual space via a Visual Sampler
Enables seamless composition with text inputs

🧠 Compositional: Flexible Prompt Combinations

Supports mixing visual and text prompts during inference
Visual prompts align with Om_h, text prompts align with Oc_h
Uses IOUmask (IoU between predicted and GT masks) for more accurate matching
Trained to generalize to unseen prompt combinations

Introduces Pm (memory prompts) to carry forward context
MaskedCrossAtt(Pm; Mp | Z) updates memory based on previous mask
Efficiently supports multi-round segmentation without re-encoding the image

🧠 Semantic-aware: Meaningful Predictions

Unlike class-agnostic models, SEEM predicts semantic labels for masks
Thanks to alignment in visual-language embedding space (zero-shot capable)
No semantic label training required for interactive tasks

🧪 Experiments Summary

SEEM demonstrates strong performance across four segmentation settings with a single unified model.

📂 Datasets and Setup

Trained on:
- COCO2017 (Panoptic & Interactive)
- RefCOCO, RefCOCO+, RefCOCOg (Referring)
Backbone: FocalT, DaViT-d3/d5
Language Encoder: UniCL / Florence
Metrics: PQ, AP, mIoU, NoC\@85/90, 1-IoU, K-NoC\@90, Zero-shot VOS

🔍 Main Results

Generic: +10 points on Panoptic PQ vs. Pix2Seqv2, SegGPT, etc.
Referring: With visual prompt: +10.5 cIoU, +6.0 mIoU, +9.3 AP50 (Tiny model)
Interactive: Better than SAM using 100x less data; supports diverse prompts
Video: Zero-shot VOS (DAVIS17) + 1-click interactive DAVIS16

📊 Summary Table

Task Type	SEEM Highlights
Generic Segmentation	+10 PQ over baselines
Referring Segmentation	+10.5 cIoU / +6.0 mIoU / +9.3 AP50 with visual prompt
Interactive Segmentation	Outperforms SAM, supports text, box, click, scribble, polygon inputs
Video Segmentation	Zero-shot DAVIS17, strong 1-click interactive performance on DAVIS16

📝 Conclusion

Ultimately, SEEM behaves like a visual ChatGPT — segmenting from multimodal prompts and refining results interactively.

Segmentation with meaning — not just masks, but concepts too!

Like SAM2, SEEM tracks objects across video frames without retraining!

Stay tuned for future experiments and hands-on tutorials!

🧠 (한국어) SEEM: Segment Everything Everywhere All at Once

🔍 텍스트, 클릭, 박스 무엇이든 OK! 멀티모달 프롬프트로 세상을 분할하는 범용 세그멘테이션 모델

논문: SEEM: Segment Everything Everywhere All at Once 발표: NeurIPS 2024 (Zou, Xueyan, et al.)
코드: UX-Decoder/Segment-Everything-Everywhere-All-At-Once
코멘트: Multi-modal prompt로 모든것을 한번에 segmentation!!

🎯 SEEM의 4가지 핵심성능!!!

🎛️ 다재다능성 (Versatility)
- 클릭, 박스, 낙서, 마스크 등 다양한 질의를 하나의 비주얼 프롬프트로 통합
- 참조 이미지까지 활용 가능한 확장성
🔗 구성 가능성 (Compositionality)
- 텍스트와 이미지 프롬프트를 함께 해석할 수 있는 공동 시각-언어 공간 학습
- 프롬프트의 자유로운 조합 가능
🔁 상호작용성 (Interactivity)
- 메모리 프롬프트를 통해 이전 분할 정보를 기억
- 사용자와의 반복적 상호작용에 최적화
🧠 의미 인식 (Semantic-awareness)
- 텍스트와 마스크 라벨을 같은 의미 공간에 인코딩
- Open-vocabulary segmentation 가능 (새로운 클래스도 인식)

📚 SEEM 등장 배경: 왜 범용 세분화 모델이 필요한가?

이미지 세분화는 컴퓨터 비전의 핵심 과제로, 픽셀 단위 수준에서 사물을 식별하고 구조화하는 중요한 작업입니다.
그동안은 Semantic Segmentation, Instance Segmentation, Panoptic Segmentation 등 다양한 접근 방식에서 연구가 있었습니다.
하지만 최근 비전 AI의 흐름은 단순한 정확도를 넘어, 더 유연하고 범용적인 세분화 모델을 향해 가고 있습니다.

🔄 세분화의 진화 방향

최근 세분화 연구는 다음과 같은 세 가지 방향으로 빠르게 확장되고 있어요!!

폐쇄형에서 개방형 세분화로 (Closed-set → Open-vocabulary)
- 기존 모델은 미리 정의된 클래스만 인식했지만, 최근에는 CLIP 같은 멀티모달 사전학습 모델을 활용해 새로운 개념까지 인식하려는 시도가 활발해지고 있습니다.
일반 세분화에서 참조 기반 세분화로 (Generic → Referring)
- 텍스트 문구로 특정 영역을 지시하고 분할하는 사용자 친화적 인터페이스가 주목받고 있으며, 언어 기반 지시를 정확히 반영하는 모델이 필요해졌습니다.
단발 실행에서 상호작용 세분화로 (One-shot → Interactive)
- 클릭, 박스 등 다양한 입력을 반복 제공하며 결과를 점진적으로 개선할 수 있는 상호작용 기반 모델이 중요해지고 있습니다.

이러한 발전은 세분화 모델을 보다 실용적으로 만들었지만, 여전히 각 작업마다 모델 구조가 분리되어 있으며, 다양한 입력 방식이나 작업 간 전환에 유연하게 대응하지 못하는 한계가 존재합니다.

🧠 하지만, 언어 모델은 이미 해결했다?

텍스트 처리 분야에서는 GPT-3, T5 등 대형 언어 모델(LLMs)이 등장하며, 다양한 언어 작업을 하나의 인터페이스로 처리할 수 있는 범용 언어 모델의 시대가 열렸습니다.

프롬프트만 바꾸면 질의응답, 번역, 요약, 대화 등 다양한 작업을 수행할 수 있는 구조가 표준이 되었습니다.

그러나 시각 세분화 분야는 여전히 다음과 같은 제한이 있습니다:

클릭, 박스, 텍스트 등 프롬프트가 제한적
모델이 의미 없는 마스크만 생성 (예: SAM)
새로운 작업 조합이나 도메인에 대한 일반화 부족

🚀 그래서 등장한 SEEM

이러한 배경 속에서 등장한 것이 바로 SEEM입니다.

SEEM은 기존의 문제점을 정면으로 돌파하며 다음을 목표로 합니다:

하나의 모델이 모든 종류의 세분화 작업을 처리
클릭, 박스, 마스크, 텍스트, 참조 이미지 등 모든 프롬프트를 통합
프롬프트 간 조합이 자유롭고, 이전 이력까지 기억하는 상호작용성
의미 있는 라벨까지 제공하는 Open Vocabulary 대응

SEEM은 마치 LLM이 텍스트를 다루듯, 시각 세분화에서 진정한 범용 인터페이스를 실현하려는 강력한 시도입니다.

✅ SEEM은 “텍스트로 지시하고, 클릭으로 수정하며, 이전 히스토리까지 기억하는”
진정한 범용 세분화 프레임워크입니다.

🧠 SEEM 모델의 구조

SEEM은 전통적인 Encoder-Decoder 구조를 기반으로 하면서,
텍스트(Text), 시각(Visual), 메모리(Memory) 프롬프트를 모두 수용할 수 있는
범용적인 세분화 모델로 설계되었습니다.

📦 1. 전체 처리 흐름

입력 이미지 (I)  
↓  
[Image Encoder] → 이미지 특징 Z 추출  
↓  
[SEEM Decoder (쿼리 + 프롬프트 상호작용)]  
↓  
→ MaskPredictor → 마스크 출력 M  
→ ConceptClassifier → 의미 라벨 출력 C

최종적으로 입력이미지(I)와 다양한 형식의 프롬포트(P) 를 받아,
Decoder를 통해 segmantation Mask(M)과 해당 마스크의 의미(C) 출력하게됩니다!!

🧱 2. 구성 요소 설명

(1) Image Encoder

입력 이미지 I ∈ ℝ^{H×W×3}
시각 특징 벡터 Z를 추출
Vision Transformer 계열 구조 사용 가능

(2) Prompts (프롬프트 유형)

Pt : Text Prompt (자연어 명령)
Pv : Visual Prompt (point, box, scribble, mask, referred region)
Pm : Memory Prompt (과거 세분화 정보를 저장)

(3) Learnable Queries (Qh)

마스크 및 클래스 출력을 위한 학습 가능한 쿼리
학습 시에는 일반/참조/상호작용 세분화에 따라 Qh가 복제됨

🔄 3. 디코더 작동 방식

(1) 쿼리-프롬프트 상호작용

⟨Om_h, Oc_h⟩ = Decoder(Qh ; ⟨Pt, Pv, Pm⟩ | Z)

Om_h : Mask에 대한 embedding
Oc_h : Class 설명에 대한 embedding

(2) 마스크 예측

M = MaskPredictor(Om_h)

(3) 의미 클래스 예측

C = ConceptClassifier(Oc_h)

🌐 4. 주요 특성

🧩 Versatile: 다양한 입력을 하나로 통합

SEEM은 클릭, 박스, 낙서, 참조 이미지 등 텍스트가 아닌 모든 입력을 하나의 시각 프롬프트(Pv)로 처리합니다.
기존 방식들과 달리, 각 입력 유형마다 별도 구조를 두지 않고 Visual Sampler를 통해 모든 비텍스트 입력을 동일한 표현 공간에 정렬합니다.
이 덕분에 텍스트 + 시각 프롬프트가 자연스럽게 조합될 수 있으며, 사용자 의도도 더 정확하게 반영됩니다.

🧠 Compositional: 프롬프트 조합에 유연하게 대응

사용자는 실제로 텍스트와 시각 프롬프트를 함께 사용하는 경우가 많습니다.
SEEM은 서로 다른 종류의 프롬프트가 함께 제공되더라도 이를 서로 다른 출력(target)에 맞춰 정렬함으로써 표현 공간 간 차이를 극복합니다.
구체적으로, 시각 프롬프트(Pv)는 마스크 임베딩(Omₕ)과, 텍스트 프롬프트(Pt)는 클래스 임베딩(Ocₕ)과 각각 정렬됩니다.
이때 정렬 기준으로 사용되는 IOUmask는 예측된 마스크와 실제 마스크 간의 겹침 정도(IoU: Intersection over Union)를 활용하여, 어떤 프롬프트가 어떤 출력과 잘 맞는지를 판단하는 데 도움을 줍니다.
학습 후에는 프롬프트가 없거나, 하나만 주어지거나, 또는 둘 모두 주어져도 하나의 모델로 처리할 수 있으며,
학습 중 본 적 없는 조합에도 일반화됩니다.

🔄 Interactive: 반복 상호작용으로 점진적 세분화

SEEM은 Pm이라는 메모리 프롬프트(memory prompt)를 도입하여, 이전 마스크 결과를 현재 입력에 반영합니다.
이전 마스크의 정보는 마스크 기반 크로스 어텐션(Masked Cross Attention)을 통해 특정 영역 내에서만 반영되며, 이를 통해 반복적인 입력에 따른 세분화 결과의 점진적 개선이 가능합니다.
별도의 추가 네트워크 없이 메모리 기능을 수행하여 효율성도 뛰어남.

🧠 Semantic-aware: 의미 있는 세분화 결과 제공

기존의 상호작용 세분화 모델들은 단순히 마스크만 생성하지만,
SEEM은 각 마스크가 무엇인지(semantic class)까지 예측할 수 있습니다.
시각 프롬프트와 텍스트 임베딩을 공동 시각-의미 표현 공간에 정렬하여, 학습 시 의미 라벨을 사용하지 않았더라도 제로샷으로 의미를 분류할 수 있습니다.
덕분에 SEEM은 단순한 분할을 넘어, “무엇을 분할했는가”까지 설명 가능한 모델입니다.

🧪 4. 실험 요약

SEEM은 다양한 세분화 작업을 하나의 모델로 통합해 처리하며,
다음 네 가지 주요 실험에서 강력한 성능을 입증했습니다.

📂 데이터셋 및 설정

학습 대상 작업:
- Panoptic Segmentation (COCO2017)
- Referring Segmentation (RefCOCO, RefCOCO+, RefCOCOg)
- Interactive Segmentation (COCO2017 기반 시뮬레이션 클릭)
모델 구성:
- Vision Backbone: FocalT, DaViT-d3/d5
- Language Encoder: UniCL 또는 Florence
- 디코더는 SEEM-Decoder로 교체
평가지표:
- PQ (Panoptic Quality), AP (Average Precision), mIoU
- NoC@85 / NoC@90, 1-IoU, K-NoC@90
- Video: Zero-shot 평가 (DAVIS17, DAVIS16-Interactive)

🔍 주요 실험 결과

Generic Segmentation
- 기존 범용 모델(Pix2Seqv2, Painter 등) 대비 Panoptic PQ +10포인트 향상
Referring Segmentation
- 시각 프롬프트 추가 시 cIoU +10.5, mIoU +6.0, AP50 +9.3 향상
Interactive Segmentation
- SAM보다 적은 데이터로 더 나은 성능
- 다양한 프롬프트(텍스트, 클릭, 박스 등) 조합 가능
Video Object Segmentation
- 구조 변경 없이 zero-shot 수행
- DAVIS17에서 fully-supervised 수준 성능
- DAVIS16-Interactive에서 단일 클릭으로 강력한 성능

📊 성능 종합 정리

작업 유형	SEEM 성능 요약
Generic Segmentation	기존 모델 대비 panoptic 성능 +10포인트
Referring Segmentation	시각 프롬프트 추가 시 cIoU +10.5, mIoU +6.0, AP50 +9.3 증가
Interactive Segmentation	적은 데이터로 SAM 능가, 다양한 입력 지원 (텍스트, 박스 등)
Video Segmentation	Zero-shot으로 DAVIS17/16 수준 성능, 별도 구조 수정 불필요

📝 결론

이를 통해서!! 결국!! 텍스트, 이미지의 여러 인풋 등을 프롬포트로 사용할수 있으며,
계속 대화가 가능한 chatGPT처럼 SEEM에서도 기존 프롬포트에 이어서 segmentation을 진행할 수 있었습니다!! 논문에는 이에 대한 다양한 결과 이미지들을 보여주는데!
정말 흥미롭네요!!

segmantation 에 더하여 class 의미까지!! 놀랍네요!!

아래에서는 SAM2처럼 비디오의 프레임을 추적하며 segmentation 한다는 것이 놀랍습니다!

저희도 곧 실습을 통해 알아보겠습니다!!

AI, Research

This post is licensed under CC BY 4.0 by the author.