🔎 Open-Vocabulary SAM: Segment and Recognize Twenty-thousand Classes Interactively

Posted Sep 1, 2025

By DrFirst

14 min read

🔎 Open-Vocabulary SAM: Expanding to Recognize & Segment 20,000 Classes!

Title: Open-Vocabulary SAM: Segment and Recognize Twenty-thousand Classes Interactively
Conference: ECCV 2024
Code/Checkpoints: GitHub – OVSAM
Keywords: Segment Anything, Open-Vocabulary, CLIP, Recognition, Promptable Segmentation, CLIP2SAM, SAM2CLIP
Summary: Extending SAM’s segmentation capability with open-vocabulary recognition → scalability to 20,000+ classes!

🚀 Key Highlights of Open-Vocabulary SAM

One-liner: “SAM doesn’t just cut objects anymore, it also names them!”

1) Basic Structure!! : CLIP2SAM & SAM2CLIP

(Segmentation) Encode images with CLIP → align to SAM via CLIP2SAM → segmentation results from SAM Decoder
(Recognition) Segmentation results → SAM2CLIP → object names retrieved

2) Open-Vocabulary 🎯

Recognizes 20,000+ classes without pre-defined labels.
With a text prompt (e.g., “cat”, “chair”), the segmented mask is matched to the correct label.

3) Interactive Extensibility 🛠️

Retains SAM’s point/box/everything prompts.
Users can provide prompts (words/sentences) for real-time recognition + segmentation.

4) General Vision Pipeline ⚡

Evolving from a pure segmentation tool to an open segmentation system with recognition.
In research and industry, obtain both “what + where” instantly with a single click.

VLMs have advanced! Especially CLIP with contrastive vision-language pre-training → strong zero-shot recognition.
Many studies in open-vocabulary detection/segmentation using CLIP.
Prompting has evolved: from NLP into vision with point/bbox prompts.
SAM: a large-scale model for segmentation, widely applied to tracking, generation, etc.
This work uniquely fuses CLIP and SAM!

🧱 Open-Vocabulary SAM Architecture

Baseline (a): Image Cropping Baseline

Cut the image according to the SAM mask → input to CLIP → retrieve label

Baseline (b): Feature Cropping Baseline

From CLIP embeddings, crop the region corresponding to the SAM mask → retrieve label

Problems of these baselines:
1) Using two separate backbones → high computational cost
2) Different training paradigms (SAM: supervised, CLIP: contrastive) → unstable knowledge transfer
3) Even with adapters, small object recognition remains weak
4) No prior exploration of how to fuse SAM’s dense visual features with CLIP’s semantic features for open-vocabulary segmentation

OVSAM: Unified Architecture

1) Two backbones = costly

Solution: Use CLIP encoder only; keep SAM’s prompt encoder + decoder
2) Different training (SAM vs CLIP) = unstable knowledge transfer
Solution: Introduce SAM2CLIP to bridge feature spaces
3) Small object issue
Solution: Enhance CLIP2SAM with FPN + R-CNN-like MLP
4) No SAM–CLIP integration for open-vocab
This work proposes a unified solution with open-vocabulary capabilities!

Image Encoder (CLIP + CLIP2SAM)
- Use CLIP’s visual encoder, then project features through CLIP2SAM for alignment with SAM Decoder
Prompt Encoder (SAM)
- Same as original SAM: handles point/box/mask prompts
Mask Decoder (SAM)
- Combines CLIP2SAM features + prompts → outputs segmentation masks
Recognition Head (SAM2CLIP)
- Project masks into CLIP embedding space
- Match with text embeddings via cosine similarity
- → Final output = segmentation + labeling
CLIP2SAM = “Recognition → Segmentation” bridge
SAM2CLIP = “Segmentation → Recognition” bridge

🔧 Training Recipe

Step 1: SAM2CLIP training with SA-1B (1%) (distillation loss)
- Extract feature (F_{sam}) with SAM encoder
- Extract feature (E_I) with CLIP visual encoder
- Adapter (Transformer layers) aligns CLIP features to SAM features (distillation)

\[L_{distill} = \mathrm{MSE}\!\left(F_{sam}, A_{sam2clip}\!\left(\mathrm{Fusion}\!\left(E_I^i\right)\right)\right)\]

Step 2: Joint training of CLIP2SAM + Mask Decoder with COCO/LVIS
- CLIP2SAM: transforms CLIP semantic features into SAM-compatible region features
- Pipeline:
  1. Image → CLIP encoder (frozen) → multi-scale features
  2. Prompt (point/box) → Prompt Encoder (SAM)
  3. CLIP2SAM(+FPN): multi-scale CLIP features → SAM-compatible region features
  4. Mask Decoder (SAM): predicts mask/IoU
  5. Recognition Head: Q_label vs CLIP text embedding → label score
Additional: Joint training with ImageNet → expansion to 22K classes

🧪 Experimental Results

🎯 Open-Vocabulary Segmentation

COCO (IoU_b=81.5 / IoU_n=84.0), LVIS (IoU_b=83.1 / IoU_n=83.6)
- Balanced performance across base/novel classes, outperforming baselines
FLOPs (1,180G) and parameters (304M) significantly reduced → efficiency + accuracy
With * (mask center point prompt), baselines collapse in performance
- Image-Crop baseline*: COCO IoU_n=26.4, LVIS IoU_n=2.3
- OVSAM*: IoU_b=63.6, IoU_n=67.9 → robust even with weak prompts

🎯 Segmentation

Mask quality nearly matches SAM-Huge while using ~half the parameters!

With bbox prompts from an OV-detector, OVSAM achieves strong labeling performance compared to other segmentation models.

👀 Qualitative Comparisons

Works well with both box and point prompts
Everything mode: auto-labels dozens of masks (e.g., “cat”, “dog”, “sofa”)
Useful for interactive tools, robotics/AR, accessibility technologies

🧪 Ablation Studies

Recognition Head’s text embedding precision is crucial → CLIP-based learning yields stability
Combining IoU + Text Similarity Joint Loss improves mask–text alignment
Scaling from 1K → 20K classes leads to linear runtime increase → real-time inference feasible

✅ Conclusion

Open-Vocabulary SAM = SAM’s “Segment Anything” + CLIP’s “Recognize Anything”
Enables 20,000+ class zero-shot recognition with full prompt compatibility
Ready for practical deployment: instantly outputs what + where
OVSAM evolves SAM into not just a mask generator, but a naming vision system — the new standard for segmentation + recognition!

🔎 (한국어) Open-Vocabulary SAM: 20,000개 클래스까지 인식·분할 확장!

제목: Open-Vocabulary SAM: Segment and Recognize Twenty-thousand Classes Interactively
학회: ECCV 2024
코드/체크포인트: GitHub – OVSAM
핵심 키워드: Segment Anything, Open-Vocabulary, CLIP, Recognition, Promptable Segmentation, CLIP2SAM, SAM2CLIP
요약: SAM의 분할 능력에 개방형 어휘 인식을 접목 → 20,000개 클래스 수준의 확장성 확보!

🚀 Open-Vocabulary SAM 핵심 요약

한 줄 요약: “SAM으로 객체를 자르는 것에서 멈추지 않고, 이름까지 붙여준다!”

1) 간단히 본 구조!! : CLIP2SAM & SAM2CLIP

(segmentation) 이미지를 CLIP으로 인코딩 → CLIP2SAM으로 SAM에 얼라인 → SAM Decoder에서 segmentation 결과추출
(Recognizing) 그 segmentation 결과 → SAM2CLIP → 객채 이름 추출

2) 개방형 어휘(Open-Vocabulary) 🎯

20,000개 이상의 클래스에 대해 사전 정의된 라벨 없이도 인식 가능.
텍스트 프롬프트(예: “고양이”, “의자”)를 입력하면, 분할된 마스크를 대응시켜 객체를 “알아봄”.

3) 상호작용 확장성 🛠️

기존 SAM의 포인트/박스/Everything 프롬프트를 유지.
사용자가 지정한 프롬프트(단어·문장)로 실시간 인식+분할 수행 가능.

4) 범용 비전 파이프라인 ⚡

단순 분할 툴에서 인식 가능한 오픈 세그멘테이션 시스템으로 진화.
연구·산업 현장에서 클릭 한 번으로 “무엇인지+어디인지”를 동시에 얻을 수 있음.

🔍 기존 연구의 흐름

VLM들이 발전해옴! 특히 contrastive vision-language pre-training의 CLIP으로 zero-shot잘함!!
Open Vocabulary 의 연구가 많음! CLIP을 기반으로 object detection, segmentation 영역 모두 OV로 연구들이 진행됨
Prompting의 발전. NLP에서 시작된 프롬포트, Vision에도 적용되며 point, bbox 프롬포트가 나옴
Segmentation! SAM! 초대규모 데이터와 모델로 등장한 segmentation 모델. 역량이 좋아 tracking, 이미지 생성 등 다양한 분야에 활용
이번 연구는 이 VLM(CLIP)과 SAM을 융합한 연구다!

🧱 Open-Vocabulary SAM 구조 (Architecture)

baseline(a): Image Cropping Baseline

SAM으로 자른 마스크대로 이미지를 잘라서 CLIP에 넣어서 label 찾음

baseline (b): Feature Cropping Baseline

CLIP임베딩 결과에서 SAM mask 부분만 잘라서 label 찾음

위의 Baseline 들은 몇가지 문제가 있음!
1) 2개의 별도 Backbone을 사용하기에 computational costs 가 큼!
2) SAM and CLIP 의 학습법이 다름(SAM : Supervised, CLIP : contrastive)에 따라 지식전이가 불안정함
3) 어댑터로 합치더라고 조그만 객체 인식에서 차이가 큼
4) ‘SAM의 dense visual feature와 CLIP의 semantic feature를 어떻게 합칠지에 대한 연구’ 등 SAM과 CLIP을 open-vocabulary capability로 통합하는 시도가 없었다!

OVSAM : Unified Architecture

1) 2개의 별도 Backbone을 사용하기에 computational costs 가 큼!

해결책 : CLIP encoder를 사용하자!! prompt encoder와 Decoder는 SAM 으로 쓰자!! 2) SAM and CLIP 의 학습법이 다름(SAM : Supervised, CLIP : contrastive)에 따라 지식전이가 불안정함
해결책 : SAM2CLIP 으로 SAM과 CLIP의 특징을 연결 3) 어댑터로 합치더라고 조그만 객체 인식에서 차이가 큼
해결책 : CLIP2SAM에 FPN을 넣고 R-CNN 같은 MLP를 넣어서 해결@@ 4) ‘SAM의 dense visual feature와 CLIP의 semantic feature를 어떻게 합칠지에 대한 연구’ 등 SAM과 CLIP을 open-vocabulary capability로 통합하는 시도가 없었다!
이번 연구에서 통합해보면서 Open Voca로 해결!!

Image Encoder (CLIP + CLIP2SAM)
- CLIP의 비전 인코더를 사용, CLIP feature → CLIP2SAM projection을 거쳐 SAM Decoder와 Align
Prompt Encoder (SAM)
- 기존 SAM과 동일, point/box/mask 프롬프트 입력 처리
Mask Decoder (SAM)
- CLIP2SAM feature + 프롬프트 결합 → segmentation mask 생성
Recognition Head (SAM2CLIP)
- SAM에서 얻은 마스크를 CLIP 임베딩 공간으로 투영
- 텍스트 프롬프트 임베딩과 코사인 유사도 기반 매칭
- → 결과적으로, 객체 분할 + 라벨링 동시 수행
CLIP2SAM은 “인식→분할” 연결고리, SAM2CLIP은 “분할→인식” 연결고리 역할!!

🔧 학습법(Training Recipe)

1단계: SA-1B (1%)로 SAM2CLIP 학습 (지식 전이, distillation loss)
- SAM 인코더에 넣어서 feature (F_{sam}) 추출
- CLIP 비전 인코더에 넣어서 feature (E_I) 추출
- Transformer layer로 구성된 Adapter가 CLIP feature를 SAM feature에 맞추어 segmentation 성능 보존 (지식 증류)

\[L_{distill} = \mathrm{MSE}\!\left(F_{sam}, A_{sam2clip}\!\left(\mathrm{Fusion}\!\left(E_I^i\right)\right)\right)\]

2단계: COCO/LVIS 데이터로 CLIP2SAM + Mask Decoder 공동 학습 (segmentation loss들 사용)
- CLIP2SAM : CLIP의 semantic feature를 SAM 디코더가 쓰기 좋은 형태로 변환
  1. 이미지 → CLIP 인코더(고정) → multi-scale feature 추출
  2. 프롬프트(point/box) 입력 → Prompt Encoder (SAM)
  3. CLIP2SAM(+FPN): multi-scale CLIP feature → SAM 호환 region feature로 변경
  4. Mask Decoder (SAM): Prompt embedding + 변환된 CLIP feature를 입력받아 마스크/IoU 예측
  5. Recognition Head: Q_label vs CLIP 텍스트 임베딩 유사도 → 라벨 스코어
추가: ImageNet까지 같이 학습 → 22K 클래스 분류 확장

🧪 실험 결과

🎯 Open-Vocabulary Segmentation

COCO (IoU_b=81.5 / IoU_n=84.0), LVIS (IoU_b=83.1 / IoU_n=83.6)으로,
- 기존 baseline 대비 base/novel 클래스 모두에서 균형적 성능 향상.
FLOPs(1,180G)와 파라미터 수(304M) 역시 크게 줄어, 효율성과 성능을 동시에 확보.
*가 붙은 조건(mask center point prompt)에서는 모든 baseline이 성능 급락.
- 특히 Image-Crop baseline*은 COCO IoU_n 26.4, LVIS IoU_n 2.3으로 매우 저조.
반면 Open-Vocabulary SAM*은 IoU_b=63.6, IoU_n=67.9로 여전히 안정적 성능.
즉, 프롬프트 제약이 심해져도 제안 기법의 견고성이 입증됨.

🎯 Segmentation

mask Quality에서도 SAM-H와 거의 유사했다!! 파라미터는 반인데!

그 외에도 BBox-detector 후 label 구분결과, 다른 segmentation 모델과의 비교해도 성능이 좋았다!!

👀 정성 비교

Bbox, point 에서도 segment 및 label을 잘함!
Everything 모드에서 추출된 수십 개 마스크에 자동 라벨링 부여 가능
예: “고양이”, “강아지”, “소파”를 자동으로 구분
상호작용형 학습 도구, 로보틱스·AR, 접근성 기술에 즉시 활용 가능

🧪 Ablation 분석

Recognition Head의 텍스트 임베딩 정밀도가 중요 → CLIP 기반 학습이 가장 안정적
마스크–텍스트 정렬 시, IoU + Text Similarity Joint Loss가 성능 향상에 기여
클래스 개수 확장(1천 → 2만)에도 추론 속도 선형 증가 → 실시간성 유지

✅ 결론

Open-Vocabulary SAM은 기존 SAM의 “Segment Anything” 능력에 “Recognize Anything”을 더한 모델!
20,000개 클래스 이상 Zero-shot 인식 가능하며, 프롬프트 상호작용 호환성 덕분에 즉시 실전 배치 가능.
단순히 마스크를 자르는 도구가 아니라, 이름 붙이는 AI 비전 시스템으로 진화한 SAM의 새로운 표준!

AI, Research

This post is licensed under CC BY 4.0 by the author.

🔎 Open-Vocabulary SAM: Expanding to Recognize & Segment 20,000 Classes!

🚀 Key Highlights of Open-Vocabulary SAM

🔍 Related Work

🧱 Open-Vocabulary SAM Architecture

Baseline (a): Image Cropping Baseline

Baseline (b): Feature Cropping Baseline

OVSAM: Unified Architecture

🔧 Training Recipe

🧪 Experimental Results

🎯 Open-Vocabulary Segmentation

🎯 Segmentation

👀 Qualitative Comparisons

🧪 Ablation Studies

✅ Conclusion

🔎 (한국어) Open-Vocabulary SAM: 20,000개 클래스까지 인식·분할 확장!

🚀 Open-Vocabulary SAM 핵심 요약

🔍 기존 연구의 흐름

🧱 Open-Vocabulary SAM 구조 (Architecture)

baseline(a): Image Cropping Baseline

baseline (b): Feature Cropping Baseline

OVSAM : Unified Architecture

🔧 학습법(Training Recipe)

🧪 실험 결과

🎯 Open-Vocabulary Segmentation

🎯 Segmentation

그 외에도 BBox-detector 후 label 구분결과, 다른 segmentation 모델과의 비교해도 성능이 좋았다!!

👀 정성 비교

🧪 Ablation 분석

✅ 결론

Trending Tags