Post

๐ŸŽญ MaskPrompt: ์˜คํ”ˆ ๋ณด์บ๋ทธ๋Ÿฌ๋ฆฌ Affordance Segmentation์„ ์œ„ํ•œ ๊ฐ์ฒด ๋งˆ์Šคํฌ ํ”„๋กฌํ”„ํŠธ

๐ŸŽญ MaskPrompt: ์˜คํ”ˆ ๋ณด์บ๋ทธ๋Ÿฌ๋ฆฌ Affordance Segmentation์„ ์œ„ํ•œ ๊ฐ์ฒด ๋งˆ์Šคํฌ ํ”„๋กฌํ”„ํŠธ

๐ŸŽญ (ํ•œ๊ตญ์–ด) MaskPrompt: ๊ฐ์ฒด Shape Mask ํ”„๋กฌํ”„ํŠธ๋กœ Open-Vocabulary Affordance Segmentation ๋‹ฌ์„ฑ!

Image

  • ์ œ๋ชฉ: MaskPrompt: Open-Vocabulary Affordance Segmentation with Object Shape Mask Prompts
  • ํ•™ํšŒ: AAAI 2025
  • ์ €์ž: Dongpan Chen, Dehui Kong, Jinghua Li, Baocai Yin (Beijing Univ. of Tech)
  • ํ•ต์‹ฌ ํ‚ค์›Œ๋“œ: Affordance, Segmentation, Open-Vocabulary, Mask Prompt, Vision-Language
  • ์š”์•ฝ: MaskPrompt๋Š” ๊ฐ์ฒด์˜ ๊ธฐ๋Šฅ ๋‹จ์œ„(affordance)๋ฅผ ๋ณต์žกํ•œ ์žฅ๋ฉด๊ณผ ์—ด๋ฆฐ ์–ดํœ˜ ์ƒํ™ฉ์—์„œ ์ •ํ™•ํžˆ ๋ถ„ํ• ํ•˜๊ธฐ ์œ„ํ•ด, ๊ฐ์ฒด ๋งˆ์Šคํฌ ๊ธฐ๋ฐ˜ ํ”„๋กฌํ”„ํŠธ๋ฅผ ํ™œ์šฉํ•˜๋Š” ์ƒˆ๋กœ์šด ๋ฐฉ๋ฒ•์„ ์ œ์•ˆ. OVAS-25 ๋ฒค์น˜๋งˆํฌ๋ฅผ ์ƒˆ๋กญ๊ฒŒ ๊ตฌ์ถ•ํ•˜๊ณ , ๊ธฐ์กด SOTA ๋Œ€๋น„ ์„ฑ๋Šฅ์„ ํฌ๊ฒŒ ๊ฐœ์„ ! ๐Ÿš€

๐Ÿš€ ์—ฐ๊ตฌ ํ•ต์‹ฌ ์š”์•ฝ

ํ•œ ์ค„ ์š”์•ฝ: โ€œMaskPrompt = ๊ฐ์ฒด ๋งˆ์Šคํฌ + ํ…์ŠคํŠธ ํ”„๋กฌํ”„ํŠธ๋กœ open-world affordance segmentation ํ•ด๊ฒฐ!โ€

1) ์ƒˆ ๊ณผ์ œ ์ •์˜ (OVAS)

  • Open-Vocabulary Affordance Segmentation (OVAS) ์ œ์•ˆ
  • ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ์— ์—†๋Š” affordance๊นŒ์ง€ ์ผ๋ฐ˜ํ™”

2) MaskPrompt ๋ฐฉ๋ฒ•๋ก 

  • Mask Prompt Generation (MPGM): DETR + SAM์œผ๋กœ ๊ฐ์ฒด ๋งˆ์Šคํฌ ์ƒ์„ฑ, Alpha-CLIP์œผ๋กœ ๋งˆ์Šคํฌ ์˜์—ญ ์บก์…˜ ์ƒ์„ฑ
  • Mask Prompt Feature Enhancement (MPFEM): ๋ฐฐ๊ฒฝ ์ œ๊ฑฐ ํ›„ ๊ฐ์ฒด ์ธ์Šคํ„ด์Šค feature ๊ฐ•ํ™”
  • Affordance Prediction Module (APM): ์‹œ๊ฐ feature + ํ…์ŠคํŠธ ํ”„๋กฌํ”„ํŠธ๋ฅผ ์œตํ•ฉํ•ด ์„ธ๋ฐ€ํ•œ affordance ๋ถ„ํ• 

3) ๋ฒค์น˜๋งˆํฌ & ์‹คํ—˜ ์„ฑ๋Šฅ

  • ์‹ ๊ทœ OVAS-25 ๋ฐ์ดํ„ฐ์…‹ ๊ตฌ์ถ• (28๊ฐœ ๊ฐ์ฒด, 25๊ฐœ affordance, 1.9๋งŒ ์ด๋ฏธ์ง€)
  • IIT-AFF, UMD ๋“ฑ ๊ธฐ์กด ๋ฐ์ดํ„ฐ์…‹์—์„œ๋„ ์šฐ์ˆ˜ํ•œ ์„ฑ๋Šฅ ๋‹ฌ์„ฑ

๐Ÿ” ๊ธฐ์กด ์—ฐ๊ตฌ์˜ ํ•œ๊ณ„์™€ ์ฐจ๋ณ„์ 

  • ๊ธฐ์กด Affordance Segmention ๋ฐฉ๋ฒ•:
    • Attention segmentation ์„ ์‹œ๋„ํ–ˆ์ง€๋งŒ, ๋ฐ์ดํ„ฐ๊ฐ€ ๋ถ€์กฑํ–ˆ๋‹ค.
    • ๊ทธ๋ž˜์„œ Weakly supervised ๋กœ ์—ฐ๊ตฌ๋„ ๋ฌ๋‹ค.
    • ์ตœ๊ทผ์—๋Š” 3D ๋ฐ์ดํ„ฐ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœํ•œ affordance Segmentation ๋„ ์žˆ์—ˆ๋‹ค.
    • ๋‹ค๋งŒ!! ์ด๋Ÿฐ ๊ธฐ์กด ์—ฐ๊ตฌ๋“ค์€ ์ „์—ญ feature๋งŒ ํ™œ์šฉ โ†’ ๋ฐฐ๊ฒฝ/์ธ์ ‘ ๊ฐ์ฒด ๊ฐ„์„ญ์— ์ทจ์•ฝํ–ˆ๋‹ค.
  • Open-Vocabulary Image Segmentation์—์„œ๋Š”??
    • ํ•™์Šต๋•Œ ๋ณด์ง€ ๋ชปํ•œ ์นดํ…Œ๊ณ ๋ฆฌ๊ณ  segmentation ํ•˜๊ณ ์žํ•œ๋‹ค!
    • ๊ธฐ์กด ์กด์žฌํ•˜๋Š” ์—ฐ๊ตฌ๋“ค์€ ์ด๋ฏธ์ง€์™€ ๋‹จ์–ด์˜ ์ž„๋ฒ ๋”ฉ์„ ์—ฐ๊ฒฐ์‹œํ‚ค๊ฑฐ๋‚˜, CLIP๊ฐ™์€ VLM์„ ์จ์„œ ์ด๋ฏธ์ง€-๋‹จ์–ด์˜ ์—ฐ๊ฒฐ๋œ ์ง€์‹์„ ํ™œ์šฉํ–ˆ๋‹ค.
    • ๋˜ํ•œ ํ”„๋กฌํฌํŠธ ๋Ÿฌ๋‹๋ฐฉ๋ฒ• ๋“ฑ ๋‹ค์–‘ํ•œ ๋ฐฉ๋ฒ•์„ ํ†ตํ•ด์„œ Segmentation ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œํ‚จ ๋ชจ๋ธ๋„ ์žˆ์—ˆ๋‹ค.

๐Ÿงฑ MaskPrompt ๊ตฌ์กฐ (Architecture)

Image

1) MPGM(mask prompt generation module): ๊ฐ์ฒด ๋งˆ์Šคํฌ + ๋งˆ์Šคํฌ ์บก์…˜ ์ƒ์„ฑ
a. Object Shape Mask(M_os)๋ฅผ ๋งŒ๋“ฌ : DETR(Bbox Detection) + SAM(segmentation) ์œผ๋กœ!
b. Mask Caption(w_mask) ์ƒ์„ฑ : ์›๋ž˜ ์ด๋ฏธ์ง€ + Mask ๋ฅผ Alpha-CLIP(BLIP2์˜ ํ™•์žฅํŒ)์— ๋„ฃ์–ด์„œ ๋งˆ์Šคํฌ์— ๋Œ€ํ•œ ์บก์…˜ ๋งŒ๋“ ๋‹ค

2) MPFEM(mask prompt feature enhancement module): ๋ฐฐ๊ฒฝ ์ œ๊ฑฐ + ๊ฐ์ฒด ์ค‘์‹ฌ feature ๊ฐ•ํ™”
a. ์›๋ž˜ ์ด๋ฏธ์ง€๋ฅผ ViT์— ๋„ฃ์–ด์„œ Global Feature ๋งŒ๋“ค๊ณ 
b. MPGM์—์„œ ๋‚˜์˜จ Object mask(M_os)๋กœ ๊ฐ์ฑ„๋ณ„๋กœ์˜ instance feature๋ฅผ ๋งŒ๋“ ๋‹ค์Œ
c. ์ด๋“ค์„ ๋ชจ๋‘ Concatํ•ด์„œ CNN์— ๋„ฃ์–ด ์ฐจ์›์šธ ์ค„์—ฌ์„œ Enhanced visual Feature (F_v)๋ฅผ ๋งŒ๋“ ๋‹ค.

3) APM(affordance prediction module): ํ…์ŠคํŠธ ํ”„๋กฌํ”„ํŠธ์™€ ์œตํ•ฉํ•ด ์ตœ์ข… affordance segmentation map ์ถœ๋ ฅ
Image

a. ์ฒซ๋ฒˆ์งธ๋กœ ํด๋ž˜์Šค์— ๋งค์นญ๋˜๋Š” mask๋ฅผ ๋งŒ๋“ค๊ณ (Mask proposals) - ๊ฐ์ฑ„ ์ด๋ฆ„(w_obj), Affordance ๋ช…์นญ(w_aff) ์ด๋ž‘ 1-b์˜ w_mask๋ฅผ ๊ฐ๊ฐ ํด๋ฆฝ์œผ๋กœ ํ† ํฌ๋‚˜์ด์ฆˆ๋กœ ํ† ํฐ๋งŒ๋“ค๊ณ ,
- ํ† ํฐ์„ ํ•ฉ์ณ์„œ CLIP์œผ๋กœ ์ž„๋ฒ ๋”ฉํ•œ F_t๋ฅผ ๋งŒ๋“ ๋‹ค!!
- 2-c์—์„œ ๋งŒ๋“  Visual Feature F_v๋ž‘ ํ…์ŠคํŠธ ์ž„๋ฒ ๋”ฉ F_t๋ฅผ Pixel Decoder์— ๋„ฃ๋Š”๋‹ค.
- Pixel Decoder๋Š” F_v๋Š” self-attention block ๋ฐ L2์ •๊ทœํ™”๋ฅผ ์ง€๋‚˜ cross-attention block์— F_t๋ž‘ ๊ฐ™์ด ๋“ค์–ด๊ฐ€๊ณ ,
- ๊ทธ๋“ค์€ ๊ทธ ๋‹ค์Œ FFN ๋ธ”๋ก์„ ์ง€๋‚˜์„œ (L๋ฒˆ ๋ฐ˜๋ณตํ•ด์„œ) F_vt๋ผ๋Š” Feature๋กœ ๋งŒ๋“ค์–ด์ง„๋‹ค.
b. ๋‘๋ฒˆ์จฐ๋กœ ๊ทธ mask์— ๋Œ€ํ•œ affordance class๋ฅผ ์˜ˆ์ธกํ•œ๋‹ค.(Mask Class Embedding)
- ๋งˆ์ง€๋ง‰์œผ๋กœ F_vt๋Š” MLP๋ฅผ ์ง€๋‚˜์„œ ํดํด๋ž˜์Šค์— ๋งค์นญ๋˜๋Š” mask(M_ca)๋ž‘ mask class embedding(F_cls)๋ฅผ ์ƒ์„ฑํ•œ๋‹ค.
- ๊ทธ๋ฆฌ๊ณ  F_cls ๋ž‘ F_t๋ฅผ dot product ํ•ด์„œ open set of affordance classes์— ๋Œ€ํ•œ ์ ์ˆ˜(s_cls)๋ฅผ ๊ตฌํ•œ๋‹ค.

  • ์ด๋•Œ์˜ Loss function์€ Class ๊ตฌ๋ถ„์— ๋Œ€ํ•œ ์ •ํ™•์„ฑ + mask์˜ ์ •ํ™•์„ ์„ ๊ฐ€์ง€๊ณ  ๊ตฌํ•จ! L = L_cls(ห†s_cls; s_cls) + ฮป*L_mask( ห†m; m)

๐Ÿงช ์‹คํ—˜ ๊ฒฐ๊ณผ

์‹คํ—˜ ๋ฐ์ดํ„ฐ์…‹

  1. OVAS-25 (๋ณธ ์—ฐ๊ตฌ์—์„œ ์ œ์•ˆํ•œ ๋ฐ์ดํ„ฐ์…‹)
    • ๊ตฌ์„ฑ: IIT-AFF + Pascal-Part-108 ์žฌ์ฃผ์„ (๊ฐ์ฒด, ์‚ฌ๋žŒ, ๋™๋ฌผ์˜ affordance ๊ธฐ์ค€์œผ๋กœ ๋ผ๋ฒจ๋ง)
    • ํด๋ž˜์Šค: 28๊ฐœ ์—”ํ‹ฐํ‹ฐ ํด๋ž˜์Šค, 25๊ฐœ affordance ํด๋ž˜์Šค
    • ๊ทœ๋ชจ: ์ด 18,938์žฅ (IIT-AFF 8,835 + Pascal 10,103)
      • ํ•™์Šต: 11,363์žฅ
      • ํ…Œ์ŠคํŠธ: 7,575์žฅ

  1. IIT-AFF (Nguyen et al. 2017)
    • ํด๋ž˜์Šค: 10๊ฐœ ๊ฐ์ฒด ์นดํ…Œ๊ณ ๋ฆฌ, 9๊ฐœ affordance ์นดํ…Œ๊ณ ๋ฆฌ
    • ๊ทœ๋ชจ: ์ด 8,835์žฅ
      • ImageNet์—์„œ 6,496์žฅ
      • ๋กœ๋ด‡ ์นด๋ฉ”๋ผ๋กœ ์ˆ˜์ง‘๋œ ๋ณต์žกํ•œ ์žฅ๋ฉด ๋น„๋””์˜ค ํ”„๋ ˆ์ž„ 2,339์žฅ

  1. Pascal-Part-108 (Michieli et al. 2020)
    • ํด๋ž˜์Šค: 20๊ฐœ ๊ฐ์ฒด ์นดํ…Œ๊ณ ๋ฆฌ, 108๊ฐœ ๊ฐ์ฒด ํŒŒํŠธ ์นดํ…Œ๊ณ ๋ฆฌ
    • ๊ทœ๋ชจ: ์ด 10,103์žฅ
    • ๋ณธ ์—ฐ๊ตฌ์—์„œ๋Š” annotation์„ affordance ๊ธฐ์ค€์œผ๋กœ ๋ณ€๊ฒฝํ•˜์—ฌ OVAS-25 ๊ตฌ์ถ•์— ํ™œ์šฉ

  1. UMD (Myers et al. 2015) & ๊ธฐํƒ€ ํŒŒํŠธ ๋ฐ์ดํ„ฐ์…‹
    • UMD affordance dataset
    • ์ถ”๊ฐ€ ํ‰๊ฐ€ ๋ฐ์ดํ„ฐ์…‹:
      • Pascal-Part-58 (Chen et al. 2014)
      • Pascal-Part-116 (Wei et al. 2024)
      • Pascal-Part-201 (Singh et al. 2022)
      • ADE20K-Part-234 (Wei et al. 2024)

์‹คํ—˜ ์„ค๊ณ„ ๋ฐ ํ‰๊ฐ€์ง€ํ‘œ

  • Object Detector: Pre-trained DETR ์‚ฌ์šฉ
    • Threshold (T = 0.7)
    • DETR, SAM, Alpha-CLIP โ†’ ๋ชจ๋‘ freeze
  • ํ•™์Šต ์„ธํŒ…
    • Iterations: 120K
    • Learning Rate: 1e-4,
      • 60K, 100K์—์„œ 10๋ฐฐ ๊ฐ์†Œ
    • Optimizer: AdamW
    • Weight Decay: 1e-4
    • Batch Size: 32
  • Pixel Decoder
    • Layer ์ˆ˜ (L): 6
    • Embedding Dimension: 768
    • Multi-head Attention Head ์ˆ˜: 12
    • Hidden Dimension (FFN): 3072
    • Feature Dimension:
      • (d = 512)
      • (d_t, d_v, d_{vt}, d_{cls} = 512)
  • ์‹คํ—˜ ํ™˜๊ฒฝ
    • NVIDIA A800 80GB GPU
  • ํ‰๊ฐ€ ์ง€ํ‘œ
    • mIoU (mean Intersection over Union)
    • mAvg (mean Average)
    • F1-Score

์‹คํ—˜ ๊ฒฐ๊ณผ ๋ฐ ๋ถ„์„

A ์‹คํ—˜๊ฒฐ๊ณผ์ง€ํ‘œ
Image

1.๐ŸŽฏ OVAS-25 (๋ณธ ๋…ผ๋ฌธ ์ œ์•ˆ ๋ฒค์น˜๋งˆํฌ)

  • MaskPrompt (ResNet-101): mIoU 71.26, F1 81.58 โ†’ ๊ธฐ์กด SOTA ๋Œ€๋น„ +5.27% ํ–ฅ์ƒ
  1. ๐ŸŽฏ ๊ธฐ์กด ๋ฐ์ดํ„ฐ์…‹ (IIT-AFF, UMD)
    • IIT-AFF: F1 89.46
    • UMD: F1 93.83 (๊ธฐ์กด ์ตœ๊ณ  ์„ฑ๋Šฅ ๋ชจ๋ธ๊ณผ ๊ฒฝ์Ÿ์ )
  2. ๐ŸŽฏ Part Segmentation ํ™•์žฅ์„ฑ
    • Pascal-Part-58, 108, 201, ADE20K-Part-234์—์„œ๋„ ๊ฐ•๋ ฅํ•œ ์ผ๋ฐ˜ํ™” ์„ฑ๋Šฅ ์ž…์ฆ

B. ๐Ÿ‘€ ์ •์„ฑ ๋น„๊ต

Image

a. ๋ณต์žกํ•œ ๋ฐฐ๊ฒฝ: ๊ธฐ์กด ๋ชจ๋ธ ๋Œ€๋น„ ๊ฐ„์„ญ ์–ต์ œ ์„ฑ๋Šฅ ์šฐ์ˆ˜
b. ์ž‘์€ ๊ฐ์ฒด ๋ถ€ํ’ˆ ํƒ์ง€: ์˜ˆ) ๋ณ‘๋šœ๊ป‘์˜ โ€œcontainโ€ affordance๊นŒ์ง€ ์ •ํ™•ํžˆ ํƒ์ง€
c. ์ธ์ ‘ ๊ฐ์ฒด ์ฒ˜๋ฆฌ: ๊ฒฝ๊ณ„๊ฐ€ ์„ž์ด๋Š” ๊ฒฝ์šฐ์—๋„ ์ •๋ฐ€ํ•˜๊ฒŒ ๋ถ„๋ฆฌ


C. ๐Ÿงช Ablation ๋ถ„์„

  • 2ํ—น: MPFEM ์„ ์ถ”๊ฐ€ โ†’ mIoU 6.9% ํ–ฅ์ƒ
  • 3ํ–‰: MPGM์ถ”๊ฐ€ + Pixel Decoder๊ฐ€ ํ…์ŠคํŠธ๋ฅผ ๋ฐ›์„ ์ˆ˜ ์žˆ๋„๋ก ๋ณ€ํ˜•๋จ (cross-attention ์ถ”๊ฐ€) โ†’ ์ถ”๊ฐ€๋กœ mIoU +2.24%
  • 4ํ–‰ : Pixel Decoder ์ถ”๊ฐ€ โ†’ ์ตœ๊ณ  ์„ฑ๋Šฅ

D. ๋˜ํ•œ Computing power ๋„ ์ ๊ฒŒ์ป๋‹ค!!

โœ… ๊ฒฐ๋ก 

  • MaskPrompt๋Š” open-vocabulary affordance segmentation์„ ์œ„ํ•œ ์ƒˆ๋กœ์šด ์ ‘๊ทผ๋ฒ•
  • ์ฃผ์š” ๊ธฐ์—ฌ:
    1. OVAS ๊ณผ์ œ ๋ฐ OVAS-25 ๋ฐ์ดํ„ฐ์…‹ ์ตœ์ดˆ ์ œ์•ˆ
    2. ๊ฐ์ฒด ๋งˆ์Šคํฌ ๊ธฐ๋ฐ˜ MaskPrompt ํ”„๋ ˆ์ž„์›Œํฌ ๊ฐœ๋ฐœ
    3. ๋‹ค์–‘ํ•œ ๋ฐ์ดํ„ฐ์…‹์—์„œ SOTA ์ˆ˜์ค€ ์„ฑ๋Šฅ ๋‹ฌ์„ฑ
  • โ†’ ๋กœ๋ด‡, HOI, AR/VR ๋“ฑ ์‹ค์„ธ๊ณ„ ์‘์šฉ์— ์ค‘์š”ํ•œ ๊ธฐ์—ฌ๋ฅผ ํ•  ์ˆ˜ ์žˆ์Œ ๐ŸŽฏ

This post is licensed under CC BY 4.0 by the author.