🧩 Segment Anything, Even Occluded (SAMEO): 가려진 부분까지 세그멘트하는 SAM 확장

Posted Aug 28, 2025

By DrFirst

16 min read

🧩 (한국어) SAMEO : 가려진 객체까지 한 번에 segmentation!!

제목: Segment Anything, Even Occluded (SAMEO)
학회: CVPR 2025
프로젝트/데모: Project Page · CVF OpenAccess PDF
핵심 키워드: Amodal Instance Segmentation, Segment Anything, EfficientSAM, Detector+Mask Decoupling, Amodal-LVIS
요약: SAMEO는 보이지 않는(가려진) 부분까지 Segmentation하기 위해, 다른 SOTA 객채탐지기로 먼저 bbox하면! SAM 활용해서 bbox된 부분 + 가려진 부분 segment를 찾는다!

🧠 주요 기여

SAMEO 프레임워크 제안
아모달 분할을 (1) 객체 검출 + (2) 마스크 복원의 두 단계로 나누고, SAM(EfficientSAM)을 플러그형 마스크 디코더로 활용해 가려진 형태까지 복원합니다. 검출기는 교체 가능하여 다양한 백본과 결합할 수 있습니다. :contentReference[oaicite:2]{index=2}
Amodal-LVIS 대규모 합성 데이터셋(≈30만 이미지)
LVIS/LVVIS를 바탕으로 아모달 주석을 합성한 Amodal-LVIS를 소개하여, 아모달 분할 연구의 학습 데이터 병목을 완화했습니다. :contentReference[oaicite:3]{index=3}
제로샷 일반화
COCOA-cls, D2SA 등 벤치마크에서 학습되지 않은 상황에도 강한 제로샷 성능을 보여줌!!
실용적 활용성
기존 모달 검출기(오픈/클로즈셋 불문)와 결합 가능하고, SAM 기반 주석 도구처럼 분할+라벨링 파이프라인에도 응용할 수 있음을 시사합니다. :contentReference[oaicite:5]{index=5}

🔍 연구 배경

Amodal segmentation(아모달 분할)이란!! 보이는 영역(Modal) + 가려진 영역(Occluded)을 모두 복원해서 segmentation 하는것!!
Instance Segmentation의 기존 방법들은 객채 탐지·분할을 한꺼번에 학습해 유연성이 적고, 대규모 학습 데이터도 부족한 단점!!
Segment Anything 은 모든 객체를 “잘” 분할하는 파운데이션 모델이었으며 이를 효율적으로 개선한 EfficientSAM 도 있었음!
기존 Amodal dataset으로는 COCO로 부터 유래한 COCOA, 데이터셋은 COCOA/D2SA/COCOA-cls 외에 KINS, DYCE, MUVA, MP3D-Amodal, WALT, KITTI-360-APS 등이 있지만, 모두 단점이있었다
- DYCE / MP3D-Amodal (합성·실내 3D 메쉬 기반)의 경우 건축 요소가 화면 대부분을 차지해 학습에 비효율적이었으며, 가시 부분이 극히 작은 객체(visible < 전체의 극소 비율)가 다수로 학습 신호 약함
- WALT (타임랩스·교통 장면 합성): 박스 교차를 이용한 객체 재합성 과정에서 비현실적인(자연스럽지 않은) 가림 발생하였고, 레이어 순차 배치로 깊이·가림 관계 왜곡의 문제
- COCOA 등 클래스 주석 포함 데이터셋: stuff(배경) 클래스 다수 포함 → 아모달 인스턴스 분할 목표와 부합하지 않는 라벨 혼재, 의미 있는 ‘물체’ 중심의 학습에 잡음 증가

📘 SAMEO 구조 (Architecture)!!

Front-end Detector: 기존(또는 선호하는) 검출기가 BBOX 값을 를 예측 및 전달
Back-end SAMEO (Mask Decoder): BBOX 값을 바탕으로 EfficientSAM 방식으로 segmentation, 다만 이미지 인코더·프롬프트 인코더는 동결하고 Mask Decoder만 미세학습
- Input : Original Image + BBox (from detector)
- 이때 bbox의 결과는 modal, amodal bbox를 5:5로 모두 활용!!

🔧 학습전력 : Loss의 구성

0) 요약

Dice → 겹침 최대화
Focal → 어려운 픽셀 강조
IoU L1 → 품질 점수 보정(신뢰도 학습)

1) Dice Loss(3) — 경계·겹침 중심

목적: 예측 마스크 M̂와 정답 마스크 M_gt의 겹침(Overlap) 최대화
계산방법법:
[ frac{2\,|M̂ \cap M_{gt}|} ]
- 분자: 교집합(겹치는 픽셀 합)
- 분모: 두 마스크의 픽셀 합 [ {|M̂| + |M_{gt}|} ]
특징: 불균형 클래스(객체가 작을 때)에서 안정적. 경계 품질 개선에 도움.

2) Focal Loss (4) — 어려운 픽셀에 가중치

목적: 이미 잘 맞춘(쉬운) 픽셀의 기여를 줄이고, 어려운 픽셀에 학습 집중
정의:
- (p_t): 타깃 클래스(포그라운드/백그라운드)에 대한 예측 확률
- (\gamma)↑ → 쉬운 샘플 억제↑, 어려운 샘플 강조↑ (γ=2로 설정)
특징: 픽셀 단위의 난이도 조절로 미세한 영역/가려진 부분 학습에 유리.

3) IoU 예측 L1 Loss (λ=0.05) — 품질 점수 보정

목적: 디코더가 내는 마스크 품질 추정치(예: IoU 헤드의 (\hat{\rho}))를 실제 IoU에 가깝게 학습
특징: 모델이 자신의 마스크 품질을 스스로 평가하도록 만들어,
후보 마스크 중 신뢰도 기반 선택/후처리에 활용 가능.
가중치: 전체 로스에서 λ = 0.05로 가볍게 반영.

📚 Amodal-LVIS 데이터셋!!

본 연구에서는 Amodal segmentation 모델 외에도 학습 데이터셋을 제시함!!
이 데이터셋은 합성으로 만들어진 데이터셋으로, 3단계의 과정을 걸체 제작됨!!
총 100만 이미지 / 200만개의 주석 규모로 구성됨

🔄 생성 파이프라인

1) Complete Object Collection (완전 객체 수집)

SAMEO로 LVIS/LVVIS 인스턴스에 아모달 마스크 의사라벨 생성
예측된 아모달 마스크와 GT 모달 마스크를 비교해 완전히 보이는(가려지지 않은) 객체만 선별
결과: 완전 객체 풀(pool) 확보

2) Synthetic Occlusion Generation (합성 가림)

풀에서 객체를 무작위 페어링하여 동일 장면에 합성 배치
비율 유지 + 크기 정규화로 자연스러운 스케일 보장
Bounding box로 상대 위치/가림 비율을 제어 → 가림 난이도 커리큘럼 구성 가능

3) Dual Annotation Mechanism (이중 주석) : 안가린 사진과 가린사진 제시!!

실험적으로 가려진 사례만 학습하면 모델이 과도한 가림 예측을 하게 됨
이를 막기 위해 각 인스턴스에 대해:
- 원본(비가려진) 이미지/마스크
- 합성(가려진) 이미지/마스크
  두 버전을 모두 제공 → 편향 감소 + 일반화 향상

🧪 Ablation & 결과 분석

Ablation1 : bbox 제공시 가림 예측하는거만(amodal), 가림무시하는거만(modal), 반반 을 비교! 반반이 제일 효과가 좋았다!
Ablation2 : 가려진 사진으로만 학습하니! 오히려 명확히 객채게 나옴에도 잘못 Segmentation 하는경우가 발생했다!
결과는?

1) 정량 결과 (Quantitative)

COCOA-cls, D2SA, MUVA 각 데이터셋의 train→test로 평가.
프런트엔드(검출기/분할기) 종류와 무관하게, SAMEO를 붙이면 AISFormer 대비 AP·AR 전반 상승.
모달/아모달 출력 모두 SAMEO가 아모달 마스크로 정제해 유사한 고성능 달성(프롬프트 타입 불문).

2) 정성 결과 (Qualitative)

복잡한 중첩(병/용기 다중 객체), 심한 가림(장애물 뒤 인물), 다양한 카테고리·자세에서
- 경계가 더 날카로운 아모달 마스크,
- 가려진 부분 추론의 합리성이 개선 → 베이스라인(AISFormer) 대비 품질 우위 확인.

3) 제로샷 성능 (Zero-shot)

학습: 자체 컬렉션 + Amodal-LVIS(단, COCOA-cls/D2SA 제외)로 진행, 배치는 로그 비율 샘플링.
평가: 제외한 두 데이터셋에서 프런트엔드 다양하게 결합해 제로샷 성능 측정.
결과: COCOA-cls(+RTMDet)에서 +13.8 AP, D2SA(+CO-DETR)에서 +8.7 AP 등 SOTA 달성, EfficientSAM을 아모달로 성공 적응하면서 제로샷 일반화 유지.

🧩 결론

여러 Object Detector와 결합하여, bbox내의 가려진 부분까지 segmentation 할수 있는 SOTA 알고리즘!!
게다가 데이터셋 공개까지! 떙큐!!

🧩 (English) SAMEO: Segment occluded objects in one shot!!

Title: Segment Anything, Even Occluded (SAMEO)
Conference: CVPR 2025
Project/Demo: Project Page · CVF OpenAccess PDF
Keywords: Amodal Instance Segmentation, Segment Anything, EfficientSAM, Detector+Mask Decoupling, Amodal-LVIS
Summary: To segment even occluded regions, SAMEO first takes bboxes from another SOTA object detector, then uses SAM to recover both the boxed area and the occluded parts!

🧠 Key Contributions

SAMEO Framework
Decomposes amodal segmentation into (1) object detection + (2) mask reconstruction and uses SAM (EfficientSAM) as a plug-in mask decoder to recover occluded shapes. The detector is swappable and can be paired with various backbones. :contentReference[oaicite:2]{index=2}
Amodal-LVIS: Large-Scale Synthetic Dataset (≈300K images)
Introduces Amodal-LVIS, synthesized from LVIS/LVVIS with amodal annotations, alleviating the training data bottleneck for amodal segmentation. :contentReference[oaicite:3]{index=3}
Zero-shot Generalization
Shows strong zero-shot performance on benchmarks like COCOA-cls and D2SA!!
Practical Utility
Compatible with existing modal detectors (open-/closed-set) and applicable to segmentation + labeling pipelines like SAM-based annotation tools. :contentReference[oaicite:5]{index=5}

🔍 Background

Amodal segmentation aims to segment both visible (modal) and occluded regions, reconstructing the full object.
Many instance segmentation methods jointly train detection and segmentation, which reduces flexibility and faces limited large-scale training data.
Segment Anything is a foundation model that segments “anything” well; EfficientSAM improves practicality with a lighter design.
Existing amodal datasets include COCOA / D2SA / COCOA-cls, and also KINS, DYCE, MUVA, MP3D-Amodal, WALT, KITTI-360-APS—but each has drawbacks:
- DYCE / MP3D-Amodal (synthetic indoor, 3D mesh-based): Architectural elements (walls/floors/ceilings) dominate the frame → inefficient signals; many samples where the visible part is extremely small, weakening supervision.
- WALT (time-lapse / traffic synthesis): Layered compositing can cause unnatural occlusions and distorted depth/occlusion relationships.
- COCOA and similar with class annotations: Many stuff (background) classes → labels not aligned with amodal instance segmentation, adding noise instead of object-centric learning.

📘 SAMEO Architecture!!

Front-end Detector: Your existing (or preferred) detector predicts and passes BBoxes.
Back-end SAMEO (Mask Decoder): Given BBoxes, performs segmentation in the EfficientSAM way; freeze the image encoder & prompt encoder and finetune only the mask decoder.
- Input: Original Image + BBox (from detector)
- Training: Use modal and amodal boxes at a 50:50 ratio!!

🔧 Training Strategy: Loss Composition

0) Summary

Dice → maximize overlap
Focal → focus on hard pixels
IoU L1 → quality score calibration (learn reliability)

1) Dice Loss (Eq. 3) — Overlap/Boundary-focused

Goal: Maximize overlap between predicted mask M̂ and ground-truth mask M_gt
Definition:
[ \mathcal{L}{\text{Dice}} = 1 - \frac{2\,|M̂ \cap M{gt}|}{|M̂| + |M_{gt}|} ]
- Numerator: intersection (overlapping pixels)
- Denominator: sum of pixels in both masks
Note: Stable under class imbalance (small objects); improves boundary quality.

2) Focal Loss (Eq. 4) — Emphasize hard pixels

Goal: Down-weight easy pixels and focus on hard ones
Definition:
[ \mathcal{L}_{\text{Focal}} = - (1 - p_t)^{\gamma}\,\log(p_t),\quad \gamma=2 ]
- (p_t): predicted probability of the target class (FG/BG)
- Larger (\gamma) → stronger suppression of easy samples, more focus on hard samples
Note: Helps on fine/occluded regions.

3) IoU Prediction L1 Loss (λ=0.05) — Score Calibration

Goal: Make the decoder’s predicted IoU (\hat{\rho}) close to the true IoU
Use: Enables confidence refinement and reliable ranking among candidate masks.
Weight: Use a small coefficient λ = 0.05 in the total loss.

📚 Amodal-LVIS Dataset!!

In addition to the amodal model, this work also presents a training dataset!
It’s a synthetic dataset created through a 3-stage pipeline.
Total size: ~1M images / ~2M annotations

🔄 Generation Pipeline

1) Complete Object Collection

Use SAMEO to generate pseudo amodal masks for LVIS/LVVIS instances.
Compare predicted amodal masks with GT modal masks to select fully visible (unoccluded) objects.
Outcome: a pool of complete objects.

2) Synthetic Occlusion Generation

Randomly pair objects from the pool and compose them into the same scene.
Preserve aspect ratios with size normalization for natural scale.
Use bounding boxes to control relative positions/occlusion ratios → enables occlusion curriculum.

3) Dual Annotation Mechanism: provide both unoccluded and occluded versions!

Training only on occluded cases leads to over-occlusion predictions.
For each instance, provide:
- Original (unoccluded) image/mask
- Synthesized (occluded) image/mask
  → Reduces bias and improves generalization.

🧪 Ablation & Results

Ablation 1: With bbox prompts, compare amodal-only, modal-only, and 50:50 mixed. The mixed setup performs best overall!
Ablation 2: Training only on occluded images leads to incorrect segmentation even when the target object is clearly indicated by the bbox!
Results?

1) Quantitative

Evaluate train→test on COCOA-cls, D2SA, MUVA.
Regardless of the front-end type, attaching SAMEO yields AP/AR gains over AISFormer.
Whether the front-end outputs modal or amodal masks, SAMEO refines them into strong amodal performance (prompt-type agnostic).

2) Qualitative

In challenging cases—complex overlaps (bottles/containers), heavy occlusions (people behind barriers), diverse categories/poses—
- Sharper amodal boundaries,
- More reasonable occlusion inference than the baseline (AISFormer).

3) Zero-shot

Training: Our collection + Amodal-LVIS (excluding COCOA-cls/D2SA), with log-proportional dataset sampling per batch.
Evaluation: Zero-shot on the two held-out datasets with various front-ends.
Results: +13.8 AP on COCOA-cls (with RTMDet), +8.7 AP on D2SA (with CO-DETR) → SOTA, successfully adapts EfficientSAM to amodal while preserving zero-shot generalization.

🧩 Conclusion

A SOTA plug-in that works with various object detectors to segment both visible and occluded regions within the bbox!
And they release a dataset as well—thanks!!

AI, Research

This post is licensed under CC BY 4.0 by the author.