Post

๐Ÿงฉ Segment Anything, Even Occluded (SAMEO): ๊ฐ€๋ ค์ง„ ๋ถ€๋ถ„๊นŒ์ง€ ์„ธ๊ทธ๋ฉ˜ํŠธํ•˜๋Š” SAM ํ™•์žฅ

๐Ÿงฉ Segment Anything, Even Occluded (SAMEO): ๊ฐ€๋ ค์ง„ ๋ถ€๋ถ„๊นŒ์ง€ ์„ธ๊ทธ๋ฉ˜ํŠธํ•˜๋Š” SAM ํ™•์žฅ

๐Ÿงฉ (ํ•œ๊ตญ์–ด) SAMEO : ๊ฐ€๋ ค์ง„ ๊ฐ์ฒด๊นŒ์ง€ ํ•œ ๋ฒˆ์— segmentation!!

Image

  • ์ œ๋ชฉ: Segment Anything, Even Occluded (SAMEO)
  • ํ•™ํšŒ: CVPR 2025
  • ํ”„๋กœ์ ํŠธ/๋ฐ๋ชจ: Project Page ยท CVF OpenAccess PDF
  • ํ•ต์‹ฌ ํ‚ค์›Œ๋“œ: Amodal Instance Segmentation, Segment Anything, EfficientSAM, Detector+Mask Decoupling, Amodal-LVIS
  • ์š”์•ฝ: SAMEO๋Š” ๋ณด์ด์ง€ ์•Š๋Š”(๊ฐ€๋ ค์ง„) ๋ถ€๋ถ„๊นŒ์ง€ Segmentationํ•˜๊ธฐ ์œ„ํ•ด, ๋‹ค๋ฅธ SOTA ๊ฐ์ฑ„ํƒ์ง€๊ธฐ๋กœ ๋จผ์ € bboxํ•˜๋ฉด! SAM ํ™œ์šฉํ•ด์„œ bbox๋œ ๋ถ€๋ถ„ + ๊ฐ€๋ ค์ง„ ๋ถ€๋ถ„ segment๋ฅผ ์ฐพ๋Š”๋‹ค!

๐Ÿง  ์ฃผ์š” ๊ธฐ์—ฌ

  1. SAMEO ํ”„๋ ˆ์ž„์›Œํฌ ์ œ์•ˆ
    ์•„๋ชจ๋‹ฌ ๋ถ„ํ• ์„ (1) ๊ฐ์ฒด ๊ฒ€์ถœ + (2) ๋งˆ์Šคํฌ ๋ณต์›์˜ ๋‘ ๋‹จ๊ณ„๋กœ ๋‚˜๋ˆ„๊ณ , SAM(EfficientSAM)์„ ํ”Œ๋Ÿฌ๊ทธํ˜• ๋งˆ์Šคํฌ ๋””์ฝ”๋”๋กœ ํ™œ์šฉํ•ด ๊ฐ€๋ ค์ง„ ํ˜•ํƒœ๊นŒ์ง€ ๋ณต์›ํ•ฉ๋‹ˆ๋‹ค. ๊ฒ€์ถœ๊ธฐ๋Š” ๊ต์ฒด ๊ฐ€๋Šฅํ•˜์—ฌ ๋‹ค์–‘ํ•œ ๋ฐฑ๋ณธ๊ณผ ๊ฒฐํ•ฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. :contentReference[oaicite:2]{index=2}

  2. Amodal-LVIS ๋Œ€๊ทœ๋ชจ ํ•ฉ์„ฑ ๋ฐ์ดํ„ฐ์…‹(โ‰ˆ30๋งŒ ์ด๋ฏธ์ง€)
    LVIS/LVVIS๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ์•„๋ชจ๋‹ฌ ์ฃผ์„์„ ํ•ฉ์„ฑํ•œ Amodal-LVIS๋ฅผ ์†Œ๊ฐœํ•˜์—ฌ, ์•„๋ชจ๋‹ฌ ๋ถ„ํ•  ์—ฐ๊ตฌ์˜ ํ•™์Šต ๋ฐ์ดํ„ฐ ๋ณ‘๋ชฉ์„ ์™„ํ™”ํ–ˆ์Šต๋‹ˆ๋‹ค. :contentReference[oaicite:3]{index=3}

  3. ์ œ๋กœ์ƒท ์ผ๋ฐ˜ํ™”
    COCOA-cls, D2SA ๋“ฑ ๋ฒค์น˜๋งˆํฌ์—์„œ ํ•™์Šต๋˜์ง€ ์•Š์€ ์ƒํ™ฉ์—๋„ ๊ฐ•ํ•œ ์ œ๋กœ์ƒท ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์คŒ!!

  4. ์‹ค์šฉ์  ํ™œ์šฉ์„ฑ
    ๊ธฐ์กด ๋ชจ๋‹ฌ ๊ฒ€์ถœ๊ธฐ(์˜คํ”ˆ/ํด๋กœ์ฆˆ์…‹ ๋ถˆ๋ฌธ)์™€ ๊ฒฐํ•ฉ ๊ฐ€๋Šฅํ•˜๊ณ , SAM ๊ธฐ๋ฐ˜ ์ฃผ์„ ๋„๊ตฌ์ฒ˜๋Ÿผ ๋ถ„ํ• +๋ผ๋ฒจ๋ง ํŒŒ์ดํ”„๋ผ์ธ์—๋„ ์‘์šฉํ•  ์ˆ˜ ์žˆ์Œ์„ ์‹œ์‚ฌํ•ฉ๋‹ˆ๋‹ค. :contentReference[oaicite:5]{index=5}


๐Ÿ” ์—ฐ๊ตฌ ๋ฐฐ๊ฒฝ

  • Amodal segmentation(์•„๋ชจ๋‹ฌ ๋ถ„ํ• )์ด๋ž€!! ๋ณด์ด๋Š” ์˜์—ญ(Modal) + ๊ฐ€๋ ค์ง„ ์˜์—ญ(Occluded)์„ ๋ชจ๋‘ ๋ณต์›ํ•ด์„œ segmentation ํ•˜๋Š”๊ฒƒ!!

  • Instance Segmentation์˜ ๊ธฐ์กด ๋ฐฉ๋ฒ•๋“ค์€ ๊ฐ์ฑ„ ํƒ์ง€ยท๋ถ„ํ• ์„ ํ•œ๊บผ๋ฒˆ์— ํ•™์Šตํ•ด ์œ ์—ฐ์„ฑ์ด ์ ๊ณ , ๋Œ€๊ทœ๋ชจ ํ•™์Šต ๋ฐ์ดํ„ฐ๋„ ๋ถ€์กฑํ•œ ๋‹จ์ !!

  • Segment Anything ์€ ๋ชจ๋“  ๊ฐ์ฒด๋ฅผ โ€œ์ž˜โ€ ๋ถ„ํ• ํ•˜๋Š” ํŒŒ์šด๋ฐ์ด์…˜ ๋ชจ๋ธ์ด์—ˆ์œผ๋ฉฐ ์ด๋ฅผ ํšจ์œจ์ ์œผ๋กœ ๊ฐœ์„ ํ•œ EfficientSAM ๋„ ์žˆ์—ˆ์Œ!

  • ๊ธฐ์กด Amodal dataset์œผ๋กœ๋Š” COCO๋กœ ๋ถ€ํ„ฐ ์œ ๋ž˜ํ•œ COCOA, ๋ฐ์ดํ„ฐ์…‹์€ COCOA/D2SA/COCOA-cls ์™ธ์— KINS, DYCE, MUVA, MP3D-Amodal, WALT, KITTI-360-APS ๋“ฑ์ด ์žˆ์ง€๋งŒ, ๋ชจ๋‘ ๋‹จ์ ์ด์žˆ์—ˆ๋‹ค

    • DYCE / MP3D-Amodal (ํ•ฉ์„ฑยท์‹ค๋‚ด 3D ๋ฉ”์‰ฌ ๊ธฐ๋ฐ˜)์˜ ๊ฒฝ์šฐ ๊ฑด์ถ• ์š”์†Œ๊ฐ€ ํ™”๋ฉด ๋Œ€๋ถ€๋ถ„์„ ์ฐจ์ง€ํ•ด ํ•™์Šต์— ๋น„ํšจ์œจ์ ์ด์—ˆ์œผ๋ฉฐ, ๊ฐ€์‹œ ๋ถ€๋ถ„์ด ๊ทนํžˆ ์ž‘์€ ๊ฐ์ฒด(visible < ์ „์ฒด์˜ ๊ทน์†Œ ๋น„์œจ)๊ฐ€ ๋‹ค์ˆ˜๋กœ ํ•™์Šต ์‹ ํ˜ธ ์•ฝํ•จ

    • WALT (ํƒ€์ž„๋žฉ์Šคยท๊ตํ†ต ์žฅ๋ฉด ํ•ฉ์„ฑ): ๋ฐ•์Šค ๊ต์ฐจ๋ฅผ ์ด์šฉํ•œ ๊ฐ์ฒด ์žฌํ•ฉ์„ฑ ๊ณผ์ •์—์„œ ๋น„ํ˜„์‹ค์ ์ธ(์ž์—ฐ์Šค๋Ÿฝ์ง€ ์•Š์€) ๊ฐ€๋ฆผ ๋ฐœ์ƒํ•˜์˜€๊ณ , ๋ ˆ์ด์–ด ์ˆœ์ฐจ ๋ฐฐ์น˜๋กœ ๊นŠ์ดยท๊ฐ€๋ฆผ ๊ด€๊ณ„ ์™œ๊ณก์˜ ๋ฌธ์ œ

    • COCOA ๋“ฑ ํด๋ž˜์Šค ์ฃผ์„ ํฌํ•จ ๋ฐ์ดํ„ฐ์…‹: stuff(๋ฐฐ๊ฒฝ) ํด๋ž˜์Šค ๋‹ค์ˆ˜ ํฌํ•จ โ†’ ์•„๋ชจ๋‹ฌ ์ธ์Šคํ„ด์Šค ๋ถ„ํ•  ๋ชฉํ‘œ์™€ ๋ถ€ํ•ฉํ•˜์ง€ ์•Š๋Š” ๋ผ๋ฒจ ํ˜ผ์žฌ, ์˜๋ฏธ ์žˆ๋Š” โ€˜๋ฌผ์ฒดโ€™ ์ค‘์‹ฌ์˜ ํ•™์Šต์— ์žก์Œ ์ฆ๊ฐ€


๐Ÿ“˜ SAMEO ๊ตฌ์กฐ (Architecture)!!

Image

  • Front-end Detector: ๊ธฐ์กด(๋˜๋Š” ์„ ํ˜ธํ•˜๋Š”) ๊ฒ€์ถœ๊ธฐ๊ฐ€ BBOX ๊ฐ’์„ ๋ฅผ ์˜ˆ์ธก ๋ฐ ์ „๋‹ฌ
  • Back-end SAMEO (Mask Decoder): BBOX ๊ฐ’์„ ๋ฐ”ํƒ•์œผ๋กœ EfficientSAM ๋ฐฉ์‹์œผ๋กœ segmentation, ๋‹ค๋งŒ ์ด๋ฏธ์ง€ ์ธ์ฝ”๋”ยทํ”„๋กฌํ”„ํŠธ ์ธ์ฝ”๋”๋Š” ๋™๊ฒฐํ•˜๊ณ  Mask Decoder๋งŒ ๋ฏธ์„ธํ•™์Šต
    • Input : Original Image + BBox (from detector)
    • ์ด๋•Œ bbox์˜ ๊ฒฐ๊ณผ๋Š” modal, amodal bbox๋ฅผ 5:5๋กœ ๋ชจ๋‘ ํ™œ์šฉ!!

๐Ÿ”ง ํ•™์Šต์ „๋ ฅ : Loss์˜ ๊ตฌ์„ฑ

Image

0) ์š”์•ฝ

  • Dice โ†’ ๊ฒน์นจ ์ตœ๋Œ€ํ™”
  • Focal โ†’ ์–ด๋ ค์šด ํ”ฝ์…€ ๊ฐ•์กฐ
  • IoU L1 โ†’ ํ’ˆ์งˆ ์ ์ˆ˜ ๋ณด์ •(์‹ ๋ขฐ๋„ ํ•™์Šต)

1) Dice Loss(3) โ€” ๊ฒฝ๊ณ„ยท๊ฒน์นจ ์ค‘์‹ฌ

  • ๋ชฉ์ : ์˜ˆ์ธก ๋งˆ์Šคํฌ Mฬ‚์™€ ์ •๋‹ต ๋งˆ์Šคํฌ M_gt์˜ ๊ฒน์นจ(Overlap) ์ตœ๋Œ€ํ™”
  • ๊ณ„์‚ฐ๋ฐฉ๋ฒ•๋ฒ•:
    [ frac{2\,|Mฬ‚ \cap M_{gt}|} ]
    • ๋ถ„์ž: ๊ต์ง‘ํ•ฉ(๊ฒน์น˜๋Š” ํ”ฝ์…€ ํ•ฉ)
    • ๋ถ„๋ชจ: ๋‘ ๋งˆ์Šคํฌ์˜ ํ”ฝ์…€ ํ•ฉ [ {|Mฬ‚| + |M_{gt}|} ]
  • ํŠน์ง•: ๋ถˆ๊ท ํ˜• ํด๋ž˜์Šค(๊ฐ์ฒด๊ฐ€ ์ž‘์„ ๋•Œ)์—์„œ ์•ˆ์ •์ . ๊ฒฝ๊ณ„ ํ’ˆ์งˆ ๊ฐœ์„ ์— ๋„์›€.

2) Focal Loss (4) โ€” ์–ด๋ ค์šด ํ”ฝ์…€์— ๊ฐ€์ค‘์น˜

  • ๋ชฉ์ : ์ด๋ฏธ ์ž˜ ๋งž์ถ˜(์‰ฌ์šด) ํ”ฝ์…€์˜ ๊ธฐ์—ฌ๋ฅผ ์ค„์ด๊ณ , ์–ด๋ ค์šด ํ”ฝ์…€์— ํ•™์Šต ์ง‘์ค‘
  • ์ •์˜:
    • (p_t): ํƒ€๊นƒ ํด๋ž˜์Šค(ํฌ๊ทธ๋ผ์šด๋“œ/๋ฐฑ๊ทธ๋ผ์šด๋“œ)์— ๋Œ€ํ•œ ์˜ˆ์ธก ํ™•๋ฅ 
    • (\gamma)โ†‘ โ†’ ์‰ฌ์šด ์ƒ˜ํ”Œ ์–ต์ œโ†‘, ์–ด๋ ค์šด ์ƒ˜ํ”Œ ๊ฐ•์กฐโ†‘ (ฮณ=2๋กœ ์„ค์ •)
  • ํŠน์ง•: ํ”ฝ์…€ ๋‹จ์œ„์˜ ๋‚œ์ด๋„ ์กฐ์ ˆ๋กœ ๋ฏธ์„ธํ•œ ์˜์—ญ/๊ฐ€๋ ค์ง„ ๋ถ€๋ถ„ ํ•™์Šต์— ์œ ๋ฆฌ.

3) IoU ์˜ˆ์ธก L1 Loss (ฮป=0.05) โ€” ํ’ˆ์งˆ ์ ์ˆ˜ ๋ณด์ •

  • ๋ชฉ์ : ๋””์ฝ”๋”๊ฐ€ ๋‚ด๋Š” ๋งˆ์Šคํฌ ํ’ˆ์งˆ ์ถ”์ •์น˜(์˜ˆ: IoU ํ—ค๋“œ์˜ (\hat{\rho}))๋ฅผ ์‹ค์ œ IoU์— ๊ฐ€๊น๊ฒŒ ํ•™์Šต
  • ํŠน์ง•: ๋ชจ๋ธ์ด ์ž์‹ ์˜ ๋งˆ์Šคํฌ ํ’ˆ์งˆ์„ ์Šค์Šค๋กœ ํ‰๊ฐ€ํ•˜๋„๋ก ๋งŒ๋“ค์–ด,
    ํ›„๋ณด ๋งˆ์Šคํฌ ์ค‘ ์‹ ๋ขฐ๋„ ๊ธฐ๋ฐ˜ ์„ ํƒ/ํ›„์ฒ˜๋ฆฌ์— ํ™œ์šฉ ๊ฐ€๋Šฅ.
  • ๊ฐ€์ค‘์น˜: ์ „์ฒด ๋กœ์Šค์—์„œ ฮป = 0.05๋กœ ๊ฐ€๋ณ๊ฒŒ ๋ฐ˜์˜.

๐Ÿ“š Amodal-LVIS ๋ฐ์ดํ„ฐ์…‹!!

Image

  • ๋ณธ ์—ฐ๊ตฌ์—์„œ๋Š” Amodal segmentation ๋ชจ๋ธ ์™ธ์—๋„ ํ•™์Šต ๋ฐ์ดํ„ฐ์…‹์„ ์ œ์‹œํ•จ!!
  • ์ด ๋ฐ์ดํ„ฐ์…‹์€ ํ•ฉ์„ฑ์œผ๋กœ ๋งŒ๋“ค์–ด์ง„ ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ, 3๋‹จ๊ณ„์˜ ๊ณผ์ •์„ ๊ฑธ์ฒด ์ œ์ž‘๋จ!!
  • ์ด 100๋งŒ ์ด๋ฏธ์ง€ / 200๋งŒ๊ฐœ์˜ ์ฃผ์„ ๊ทœ๋ชจ๋กœ ๊ตฌ์„ฑ๋จ

๐Ÿ”„ ์ƒ์„ฑ ํŒŒ์ดํ”„๋ผ์ธ

1) Complete Object Collection (์™„์ „ ๊ฐ์ฒด ์ˆ˜์ง‘)

  • SAMEO๋กœ LVIS/LVVIS ์ธ์Šคํ„ด์Šค์— ์•„๋ชจ๋‹ฌ ๋งˆ์Šคํฌ ์˜์‚ฌ๋ผ๋ฒจ ์ƒ์„ฑ
  • ์˜ˆ์ธก๋œ ์•„๋ชจ๋‹ฌ ๋งˆ์Šคํฌ์™€ GT ๋ชจ๋‹ฌ ๋งˆ์Šคํฌ๋ฅผ ๋น„๊ตํ•ด ์™„์ „ํžˆ ๋ณด์ด๋Š”(๊ฐ€๋ ค์ง€์ง€ ์•Š์€) ๊ฐ์ฒด๋งŒ ์„ ๋ณ„

    ๊ฒฐ๊ณผ: ์™„์ „ ๊ฐ์ฒด ํ’€(pool) ํ™•๋ณด

2) Synthetic Occlusion Generation (ํ•ฉ์„ฑ ๊ฐ€๋ฆผ)

  • ํ’€์—์„œ ๊ฐ์ฒด๋ฅผ ๋ฌด์ž‘์œ„ ํŽ˜์–ด๋งํ•˜์—ฌ ๋™์ผ ์žฅ๋ฉด์— ํ•ฉ์„ฑ ๋ฐฐ์น˜
  • ๋น„์œจ ์œ ์ง€ + ํฌ๊ธฐ ์ •๊ทœํ™”๋กœ ์ž์—ฐ์Šค๋Ÿฌ์šด ์Šค์ผ€์ผ ๋ณด์žฅ
  • Bounding box๋กœ ์ƒ๋Œ€ ์œ„์น˜/๊ฐ€๋ฆผ ๋น„์œจ์„ ์ œ์–ด โ†’ ๊ฐ€๋ฆผ ๋‚œ์ด๋„ ์ปค๋ฆฌํ˜๋Ÿผ ๊ตฌ์„ฑ ๊ฐ€๋Šฅ

3) Dual Annotation Mechanism (์ด์ค‘ ์ฃผ์„) : ์•ˆ๊ฐ€๋ฆฐ ์‚ฌ์ง„๊ณผ ๊ฐ€๋ฆฐ์‚ฌ์ง„ ์ œ์‹œ!!

  • ์‹คํ—˜์ ์œผ๋กœ ๊ฐ€๋ ค์ง„ ์‚ฌ๋ก€๋งŒ ํ•™์Šตํ•˜๋ฉด ๋ชจ๋ธ์ด ๊ณผ๋„ํ•œ ๊ฐ€๋ฆผ ์˜ˆ์ธก์„ ํ•˜๊ฒŒ ๋จ
  • ์ด๋ฅผ ๋ง‰๊ธฐ ์œ„ํ•ด ๊ฐ ์ธ์Šคํ„ด์Šค์— ๋Œ€ํ•ด:
    • ์›๋ณธ(๋น„๊ฐ€๋ ค์ง„) ์ด๋ฏธ์ง€/๋งˆ์Šคํฌ
    • ํ•ฉ์„ฑ(๊ฐ€๋ ค์ง„) ์ด๋ฏธ์ง€/๋งˆ์Šคํฌ
      ๋‘ ๋ฒ„์ „์„ ๋ชจ๋‘ ์ œ๊ณต โ†’ ํŽธํ–ฅ ๊ฐ์†Œ + ์ผ๋ฐ˜ํ™” ํ–ฅ์ƒ

๐Ÿงช Ablation & ๊ฒฐ๊ณผ ๋ถ„์„

  • Ablation1 : bbox ์ œ๊ณต์‹œ ๊ฐ€๋ฆผ ์˜ˆ์ธกํ•˜๋Š”๊ฑฐ๋งŒ(amodal), ๊ฐ€๋ฆผ๋ฌด์‹œํ•˜๋Š”๊ฑฐ๋งŒ(modal), ๋ฐ˜๋ฐ˜ ์„ ๋น„๊ต! ๋ฐ˜๋ฐ˜์ด ์ œ์ผ ํšจ๊ณผ๊ฐ€ ์ข‹์•˜๋‹ค!

  • Ablation2 : ๊ฐ€๋ ค์ง„ ์‚ฌ์ง„์œผ๋กœ๋งŒ ํ•™์Šตํ•˜๋‹ˆ! ์˜คํžˆ๋ ค ๋ช…ํ™•ํžˆ ๊ฐ์ฑ„๊ฒŒ ๋‚˜์˜ด์—๋„ ์ž˜๋ชป Segmentation ํ•˜๋Š”๊ฒฝ์šฐ๊ฐ€ ๋ฐœ์ƒํ–ˆ๋‹ค!

  • ๊ฒฐ๊ณผ๋Š”?

Image

1) ์ •๋Ÿ‰ ๊ฒฐ๊ณผ (Quantitative)

  • COCOA-cls, D2SA, MUVA ๊ฐ ๋ฐ์ดํ„ฐ์…‹์˜ trainโ†’test๋กœ ํ‰๊ฐ€.
  • ํ”„๋ŸฐํŠธ์—”๋“œ(๊ฒ€์ถœ๊ธฐ/๋ถ„ํ• ๊ธฐ) ์ข…๋ฅ˜์™€ ๋ฌด๊ด€ํ•˜๊ฒŒ, SAMEO๋ฅผ ๋ถ™์ด๋ฉด AISFormer ๋Œ€๋น„ APยทAR ์ „๋ฐ˜ ์ƒ์Šน.
  • ๋ชจ๋‹ฌ/์•„๋ชจ๋‹ฌ ์ถœ๋ ฅ ๋ชจ๋‘ SAMEO๊ฐ€ ์•„๋ชจ๋‹ฌ ๋งˆ์Šคํฌ๋กœ ์ •์ œํ•ด ์œ ์‚ฌํ•œ ๊ณ ์„ฑ๋Šฅ ๋‹ฌ์„ฑ(ํ”„๋กฌํ”„ํŠธ ํƒ€์ž… ๋ถˆ๋ฌธ).

2) ์ •์„ฑ ๊ฒฐ๊ณผ (Qualitative)

  • ๋ณต์žกํ•œ ์ค‘์ฒฉ(๋ณ‘/์šฉ๊ธฐ ๋‹ค์ค‘ ๊ฐ์ฒด), ์‹ฌํ•œ ๊ฐ€๋ฆผ(์žฅ์• ๋ฌผ ๋’ค ์ธ๋ฌผ), ๋‹ค์–‘ํ•œ ์นดํ…Œ๊ณ ๋ฆฌยท์ž์„ธ์—์„œ
    • ๊ฒฝ๊ณ„๊ฐ€ ๋” ๋‚ ์นด๋กœ์šด ์•„๋ชจ๋‹ฌ ๋งˆ์Šคํฌ,
    • ๊ฐ€๋ ค์ง„ ๋ถ€๋ถ„ ์ถ”๋ก ์˜ ํ•ฉ๋ฆฌ์„ฑ์ด ๊ฐœ์„  โ†’ ๋ฒ ์ด์Šค๋ผ์ธ(AISFormer) ๋Œ€๋น„ ํ’ˆ์งˆ ์šฐ์œ„ ํ™•์ธ.

3) ์ œ๋กœ์ƒท ์„ฑ๋Šฅ (Zero-shot)

  • ํ•™์Šต: ์ž์ฒด ์ปฌ๋ ‰์…˜ + Amodal-LVIS(๋‹จ, COCOA-cls/D2SA ์ œ์™ธ)๋กœ ์ง„ํ–‰, ๋ฐฐ์น˜๋Š” ๋กœ๊ทธ ๋น„์œจ ์ƒ˜ํ”Œ๋ง.
  • ํ‰๊ฐ€: ์ œ์™ธํ•œ ๋‘ ๋ฐ์ดํ„ฐ์…‹์—์„œ ํ”„๋ŸฐํŠธ์—”๋“œ ๋‹ค์–‘ํ•˜๊ฒŒ ๊ฒฐํ•ฉํ•ด ์ œ๋กœ์ƒท ์„ฑ๋Šฅ ์ธก์ •.
  • ๊ฒฐ๊ณผ: COCOA-cls(+RTMDet)์—์„œ +13.8 AP, D2SA(+CO-DETR)์—์„œ +8.7 AP ๋“ฑ SOTA ๋‹ฌ์„ฑ, EfficientSAM์„ ์•„๋ชจ๋‹ฌ๋กœ ์„ฑ๊ณต ์ ์‘ํ•˜๋ฉด์„œ ์ œ๋กœ์ƒท ์ผ๋ฐ˜ํ™” ์œ ์ง€.

๐Ÿงฉ ๊ฒฐ๋ก 

  • ์—ฌ๋Ÿฌ Object Detector์™€ ๊ฒฐํ•ฉํ•˜์—ฌ, bbox๋‚ด์˜ ๊ฐ€๋ ค์ง„ ๋ถ€๋ถ„๊นŒ์ง€ segmentation ํ• ์ˆ˜ ์žˆ๋Š” SOTA ์•Œ๊ณ ๋ฆฌ์ฆ˜!!
  • ๊ฒŒ๋‹ค๊ฐ€ ๋ฐ์ดํ„ฐ์…‹ ๊ณต๊ฐœ๊นŒ์ง€! ๋–™ํ!!

๐Ÿงฉ (English) SAMEO: Segment occluded objects in one shot!!

Image

  • Title: Segment Anything, Even Occluded (SAMEO)
  • Conference: CVPR 2025
  • Project/Demo: Project Page ยท CVF OpenAccess PDF
  • Keywords: Amodal Instance Segmentation, Segment Anything, EfficientSAM, Detector+Mask Decoupling, Amodal-LVIS
  • Summary: To segment even occluded regions, SAMEO first takes bboxes from another SOTA object detector, then uses SAM to recover both the boxed area and the occluded parts!

๐Ÿง  Key Contributions

  1. SAMEO Framework
    Decomposes amodal segmentation into (1) object detection + (2) mask reconstruction and uses SAM (EfficientSAM) as a plug-in mask decoder to recover occluded shapes. The detector is swappable and can be paired with various backbones. :contentReference[oaicite:2]{index=2}

  2. Amodal-LVIS: Large-Scale Synthetic Dataset (โ‰ˆ300K images)
    Introduces Amodal-LVIS, synthesized from LVIS/LVVIS with amodal annotations, alleviating the training data bottleneck for amodal segmentation. :contentReference[oaicite:3]{index=3}

  3. Zero-shot Generalization
    Shows strong zero-shot performance on benchmarks like COCOA-cls and D2SA!!

  4. Practical Utility
    Compatible with existing modal detectors (open-/closed-set) and applicable to segmentation + labeling pipelines like SAM-based annotation tools. :contentReference[oaicite:5]{index=5}


๐Ÿ” Background

  • Amodal segmentation aims to segment both visible (modal) and occluded regions, reconstructing the full object.

  • Many instance segmentation methods jointly train detection and segmentation, which reduces flexibility and faces limited large-scale training data.

  • Segment Anything is a foundation model that segments โ€œanythingโ€ well; EfficientSAM improves practicality with a lighter design.

  • Existing amodal datasets include COCOA / D2SA / COCOA-cls, and also KINS, DYCE, MUVA, MP3D-Amodal, WALT, KITTI-360-APSโ€”but each has drawbacks:

    • DYCE / MP3D-Amodal (synthetic indoor, 3D mesh-based): Architectural elements (walls/floors/ceilings) dominate the frame โ†’ inefficient signals; many samples where the visible part is extremely small, weakening supervision.

    • WALT (time-lapse / traffic synthesis): Layered compositing can cause unnatural occlusions and distorted depth/occlusion relationships.

    • COCOA and similar with class annotations: Many stuff (background) classes โ†’ labels not aligned with amodal instance segmentation, adding noise instead of object-centric learning.


๐Ÿ“˜ SAMEO Architecture!!

Image

  • Front-end Detector: Your existing (or preferred) detector predicts and passes BBoxes.
  • Back-end SAMEO (Mask Decoder): Given BBoxes, performs segmentation in the EfficientSAM way; freeze the image encoder & prompt encoder and finetune only the mask decoder.
    • Input: Original Image + BBox (from detector)
    • Training: Use modal and amodal boxes at a 50:50 ratio!!

๐Ÿ”ง Training Strategy: Loss Composition

Image

0) Summary

  • Dice โ†’ maximize overlap
  • Focal โ†’ focus on hard pixels
  • IoU L1 โ†’ quality score calibration (learn reliability)

1) Dice Loss (Eq. 3) โ€” Overlap/Boundary-focused

  • Goal: Maximize overlap between predicted mask Mฬ‚ and ground-truth mask M_gt
  • Definition:
    [ \mathcal{L}{\text{Dice}} = 1 - \frac{2\,|Mฬ‚ \cap M{gt}|}{|Mฬ‚| + |M_{gt}|} ]
    • Numerator: intersection (overlapping pixels)
    • Denominator: sum of pixels in both masks
  • Note: Stable under class imbalance (small objects); improves boundary quality.

2) Focal Loss (Eq. 4) โ€” Emphasize hard pixels

  • Goal: Down-weight easy pixels and focus on hard ones
  • Definition:
    [ \mathcal{L}_{\text{Focal}} = - (1 - p_t)^{\gamma}\,\log(p_t),\quad \gamma=2 ]
    • (p_t): predicted probability of the target class (FG/BG)
    • Larger (\gamma) โ†’ stronger suppression of easy samples, more focus on hard samples
  • Note: Helps on fine/occluded regions.

3) IoU Prediction L1 Loss (ฮป=0.05) โ€” Score Calibration

  • Goal: Make the decoderโ€™s predicted IoU (\hat{\rho}) close to the true IoU
  • Use: Enables confidence refinement and reliable ranking among candidate masks.
  • Weight: Use a small coefficient ฮป = 0.05 in the total loss.

๐Ÿ“š Amodal-LVIS Dataset!!

Image

  • In addition to the amodal model, this work also presents a training dataset!
  • Itโ€™s a synthetic dataset created through a 3-stage pipeline.
  • Total size: ~1M images / ~2M annotations

๐Ÿ”„ Generation Pipeline

1) Complete Object Collection

  • Use SAMEO to generate pseudo amodal masks for LVIS/LVVIS instances.
  • Compare predicted amodal masks with GT modal masks to select fully visible (unoccluded) objects.

    Outcome: a pool of complete objects.

2) Synthetic Occlusion Generation

  • Randomly pair objects from the pool and compose them into the same scene.
  • Preserve aspect ratios with size normalization for natural scale.
  • Use bounding boxes to control relative positions/occlusion ratios โ†’ enables occlusion curriculum.

3) Dual Annotation Mechanism: provide both unoccluded and occluded versions!

  • Training only on occluded cases leads to over-occlusion predictions.
  • For each instance, provide:
    • Original (unoccluded) image/mask
    • Synthesized (occluded) image/mask
      โ†’ Reduces bias and improves generalization.

๐Ÿงช Ablation & Results

  • Ablation 1: With bbox prompts, compare amodal-only, modal-only, and 50:50 mixed. The mixed setup performs best overall!

  • Ablation 2: Training only on occluded images leads to incorrect segmentation even when the target object is clearly indicated by the bbox!

  • Results?

Image

1) Quantitative

  • Evaluate trainโ†’test on COCOA-cls, D2SA, MUVA.
  • Regardless of the front-end type, attaching SAMEO yields AP/AR gains over AISFormer.
  • Whether the front-end outputs modal or amodal masks, SAMEO refines them into strong amodal performance (prompt-type agnostic).

2) Qualitative

  • In challenging casesโ€”complex overlaps (bottles/containers), heavy occlusions (people behind barriers), diverse categories/posesโ€”
    • Sharper amodal boundaries,
    • More reasonable occlusion inference than the baseline (AISFormer).

3) Zero-shot

  • Training: Our collection + Amodal-LVIS (excluding COCOA-cls/D2SA), with log-proportional dataset sampling per batch.
  • Evaluation: Zero-shot on the two held-out datasets with various front-ends.
  • Results: +13.8 AP on COCOA-cls (with RTMDet), +8.7 AP on D2SA (with CO-DETR) โ†’ SOTA, successfully adapts EfficientSAM to amodal while preserving zero-shot generalization.

๐Ÿงฉ Conclusion

  • A SOTA plug-in that works with various object detectors to segment both visible and occluded regions within the bbox!
  • And they release a dataset as wellโ€”thanks!!
This post is licensed under CC BY 4.0 by the author.