Post

๐Ÿ” WSMA: Multimodal Weak Supervision์œผ๋กœ Egocentric Affordance Grounding ํ˜์‹ !

๐Ÿ” WSMA: Multimodal Weak Supervision์œผ๋กœ Egocentric Affordance Grounding ํ˜์‹ !

๐Ÿ” (ํ•œ๊ตญ์–ด) WSMA: Multimodal ์•ฝ์ง€๋„ ํ•™์Šต์œผ๋กœ Affordance Grounding ๊ณ ๋„ํ™”!

Image

  • ์ œ๋ชฉ: Weakly Supervised Multimodal Affordance Grounding for Egocentric Images
  • ํ•™ํšŒ: AAAI 2024
  • ์ฝ”๋“œ/์ฒดํฌํฌ์ธํŠธ: GitHub โ€“ WSMA
  • ์ €์ž: Lingjing Xu, Yang Gao, Wenfeng Song, Aimin Hao (Beihang Univ., BISTU)
  • ํ•ต์‹ฌ ํ‚ค์›Œ๋“œ: Affordance, Weakly-Supervised, Multimodal, CLIP, Egocentric, Robotics
  • ์š”์•ฝ: WSMA๋Š” exocentric ์ด๋ฏธ์ง€ + ํ…์ŠคํŠธ ์„ค๋ช…์—์„œ affordance ์ง€์‹์„ ์ถ”์ถœํ•˜๊ณ , ์ด๋ฅผ egocentric ์ด๋ฏธ์ง€๋กœ ์ „์ดํ•˜๋Š” ์ƒˆ๋กœ์šด ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ํ”„๋ ˆ์ž„์›Œํฌ. ํ”ฝ์…€ ๋‹จ์œ„ ์ฃผ์„ ์—†์ด๋„ affordance ์˜์—ญ์„ ์ •ํ™•ํžˆ ์ฐพ์œผ๋ฉฐ, ๊ธฐ์กด SOTA๋ณด๋‹ค ์šฐ์ˆ˜ํ•œ ์„ฑ๋Šฅ์„ ๋ณด์ž„! ๐Ÿš€

๐Ÿš€ ์—ฐ๊ตฌ ํ•ต์‹ฌ ์š”์•ฝ

ํ•œ ์ค„ ์š”์•ฝ: โ€œWSMA = Exocentric + Text โ†’ Egocentric ์ „์ด โ†’ Weakly supervised๋กœ๋„ ์ •ํ™•ํ•œ affordance ์ง€์—ญํ™” ๋‹ฌ์„ฑ!โ€

1) ์—ฐ๊ตฌ ๋ฐฐ๊ฒฝ (Affordance Grounding)

  • ๊ฐ์ฒด๊ฐ€ ์ œ๊ณตํ•˜๋Š” ํ–‰๋™ ๊ฐ€๋Šฅ์„ฑ(action possibilities) โ†’ ์ปต์€ โ€œ๋งˆ์‹œ๊ธฐโ€, ์นผ๋‚ ์€ โ€œ์ž๋ฅด๊ธฐโ€
  • ๋ฌธ์ œ: ๊ธฐ์กด ์—ฐ๊ตฌ๋Š” Pixel ๋‹จ์œ„ ์–ด๋…ธํ…Œ์ด์…˜ ์˜์กด โ†’ ๋น„์šฉโ†‘, ์˜ค๋ฅ˜โ†‘
  • ํ˜„์‹ค์  ํ•™์Šต: ์ด๋ฏธ์ง€ ์ˆ˜์ค€ ๋ผ๋ฒจ(image-level labels) ๋งŒ์œผ๋กœ affordance ์˜์—ญ ํ•™์Šต ํ•„์š”(Weakly supervised )

2) WSMA ๋ฐฉ๋ฒ•๋ก 

  • HOI-Transfer Module: exocentric ์ด๋ฏธ์ง€(์‚ฌ๋žŒ-๊ฐ์ฒด ์ƒํ˜ธ์ž‘์šฉ)์—์„œ affordance ์ง€์‹ ์ถ”์ถœ โ†’ egocentric ์ด๋ฏธ์ง€๋กœ ์ „์ด
  • Pixel-Text Fusion Module: affordance ํ…์ŠคํŠธ(CLIP text encoder ํ™œ์šฉ)์™€ egocentric ์ด๋ฏธ์ง€ ํŠน์ง• ๊ฒฐํ•ฉ
  • Weak Supervision: CAM + Refined Module ๊ธฐ๋ฐ˜ ์•ฝ์ง€๋„ ์ถ”๋ก  โ†’ ์„ธ๋ฐ€ํ•œ ์˜์—ญ ๋ถ„ํ• 

3) ์ตœ์ข… ์ถœ๋ ฅ

  • egocentric ์ด๋ฏธ์ง€์—์„œ affordance heatmap ์‚ฐ์ถœ
  • ํ”ฝ์…€ ๋‹จ์œ„ ์ฃผ์„ ์—†์ด๋„ โ€œ์žก๊ธฐ, ๋งˆ์‹œ๊ธฐ, ์ž๋ฅด๊ธฐโ€ ๋“ฑ์˜ ๊ธฐ๋Šฅ ์˜์—ญ ์ •ํ™•ํžˆ localize

๐Ÿ” ๊ธฐ์กด ์—ฐ๊ตฌ์˜ ํ•œ๊ณ„์™€ ์ฐจ๋ณ„์ 

  • Visual Affordance Grounding ์—ฐ๊ตฌ
    • Affordance๋ผ ํ•จ์€ Gibson์— ์˜ํ•˜์—ฌ ์ •์˜๋˜์–ด์„œ Visual Affordance Grounding ์ฐจ์›์—์„œ ์ด์–ด์ ธ์™”๋‹ค!
    • ๋‹ค๋งŒ, ํ”ฝ์…€ ๋‹จ์œ„ GT์— ์˜์กด โ†’ ๋น„์‹ธ๊ณ  ์˜ค๋ฅ˜ ๋งŽ๊ธฐ์— ์—ฌ๋Ÿฌ weakly supervised approaches ์ ‘๊ทผ๋„ ์žˆ์—ˆ๋‹ค.
    • ๋ช‡ ๊ฐœ์˜ ์ ์„ ๋ฐ”ํƒ•์œผ๋กœํ•œ ํ•™์Šต์ด๋ผ๋˜์ง€, ๋น„๋””์˜ค๋กœ๋ถ€ํ„ฐ์˜ ํ•™์Šต๋ฒ• ๋“ฑ!!
    • ๊ฐ€์žฅ ์ตœ๊ทผ์—๋Š” ์ด๋ฏธ์ง€ level์˜ ๋ ˆ์ด๋ธ”๋ง์„ ํ†ตํ•œ weakly supervised approaches ๊ฐ€ ์žˆ์—ˆ์Œ!!
    • ๋ณธ ์—ฐ๊ตฌ๋Š” ์ด์— ๋”ํ•ด์„œ!! action์— ๋Œ€ํ•œ ๊ธ€์ž ์ •๋ณด๋ฅผ ํ™œ์šฉํ•œ๋‹ค!!
  • Cross-view Knowledge Distillation
    • Knowledge distillation ์€ ๋”ฅ๋Ÿฌ๋‹ ๊ธฐ์ˆ ์—์„œ ์„ ์ƒ/ํ•™์ƒ๋ชจ๋ธ์„ ๋‘๊ณ  ๊ฐ€๋ฅด์น˜๋Š” ๊ธฐ๋ฒ•.
    • ๋ฐ˜๋ฉด, cross-view knowledge distillation์€! ๋‹ค๋ฅธ ๊ด€์ ์—์„œ์˜ ์ง€์‹์„ ์ „์ดํ•˜๋Š”๋ฐ ์ง‘์ค‘ํ•จ!!
    • exo์ด๋ฏธ์ง€์˜ ์ง€์‹์„ ego ์ด๋ฏธ์ง€๋กœ ์ „์ดํ•˜๋Š” ์—ฐ๊ตฌ๋“ค์ด ์žˆ์—ˆ์Œ!!
  • Vision-language Models
    • ๋”์ด์ƒ ์„ค๋ช…์ด ํ•„์š” ์—†๋Š” CLIP!!
    • CLIP์„ ํ™œ์šฉํ•œ segmentation ๋“ฑ ๋‹ค์–‘ํ•œ ์—ฐ๊ตฌ๊ฐ€์žˆ๋‹ค!
    • ์ด๋ฒˆ ์—ฐ๊ตฌ๋Š” CLIP์„ ํ†ตํ•ด textual features๋ฅผ ์ถ”์ถœํ• ๊ทธ๋‹ค!

๐Ÿงฑ WSMA ๊ตฌ์กฐ (Architecture)

Image

  • 3๊ฐœ์˜ ์ฃผ์š” Branch๋กœ ๊ตฌ์„ฑ : Exocentric, Egocentric, and Text branches.

1) Egocentric Branch

  • Egocentric ์ด๋ฏธ์ง€(I-g)๋ฅผ DINO-ViT๋กœ Feature ์ถ”์ถœ!
  • ํ•ด๋‹น Feature๋ฅผ 2 layer์˜ MLP๋กœ ๋ณด๋‚ธ๋‹ค!
  • ๊ฒฐ๊ตญ ์ด๋ฏธ์ง€ Feature (f_g) ์ถ”์ถœ!!

2) Text Branch

  • affordance label (C)์— ๋Œ€ํ•˜์—ฌ ์„ค๋ช…ํ•œ Affordance Text(T)๋ฅผ CoOp ๋ฐฉ์‹์— ์˜๊ฑฐ, trainable prompts(V)๋ฅผ ํ†ตํ•ด ๋งŒ๋“ ๋‹ค.
  • ์ด๋ฅผ CLIP์— ๋„ฃ์–ด Text Feature(f_t)๋ฅผ ๋ฝ‘๋Š”๋‹ค

3) Pixel-Text Fusion Module

  • Ego ์ด๋ฏธ์ง€์™€ Text ์ •๋ณด๋ฅผ ์ž˜ ํ•ฉ์น˜๋Š” ๋ถ€๋ถ„!
  • ์ด๋ฏธ์ง€(f_g), Text(f_t)์˜ align ํ•ด์•ผํ•œ๋‹ค!! a. (Alignment1) ์ด๋ฏธ์ง€์ •๋ณด(f`_g) ๋ž‘ ํ…์ŠคํŠธ์ •๋ณด(f_t)์˜ Global align ์ •๋„๋ฅผ ํ‰๊ฐ€ > L_clip
    • f_g๋Š” DINO๋กœ์„œ local ์ •๋ณด๋งŒ ์žˆ๋‹ค. ๊ทธ๋ž˜์„œ Global์ •๋ณด๋ฅผ ์ถ”๊ฐ€ํ•ด์ค€๋‹ค!
    • fโ€ฒ_g = AttentionPool(Concat(Average(fg), fg)).
    • Average(fg) : ๊ธ€๋กœ๋ฒŒ ์ •๋ณด
    • AttentionPool๋กœ ๊ฐ€๊ณต : local ํŒจ์น˜์™€ global ์‚ฌ์ด ๊ด€๊ณ„(์œ ์‚ฌ๋„) ๊ณ„์‚ฐ, ์ค‘์š”ํ•œ ํŒจ์น˜์— ๋†’์€ weight๋ฅผ ์ฃผ์–ด ๊ฐ€์ค‘ํ•ฉ๋œ ์ƒˆ๋กœ์šด ๋ฒกํ„ฐ f`_g ์‚ฐ์ถœ!!
    • Z_clip = f_โ€ฒg(0) ยท f_t(transpose), โ€ฒ
    • ์ด๋ฏธ์ง€์ •๋ณด(f`_g) ๋ž‘ ํ…์ŠคํŠธ์ •๋ณด(f_t)๊ฐ€ ์–ผ๋งˆ๋‚˜ ์œ ์‚ฌํ•œ์ง€๋ฅผ ์‚ฐ์ถœ(Z_clip)
    • ๊ฒฐ๊ตญ Z_clip์€ ์ด๋ฏธ์ง€์ •๋ณด(f`_g) ๋ž‘ ํ…์ŠคํŠธ์ •๋ณด(f_t)์˜ align ์ •๋„๋ฅผ ํ‰๊ฐ€ํ•˜๋ฉฐ ์ดํ›„ cross-entropy loss ์šฉ L_clip ์œผ๋กœ ํ™œ์šฉ๋Œ

โ€‹ b. (Alignment2) ํ…์ŠคํŠธ์™€ ๊ฐ ์ด๋ฏธ์ง€ ์œ„์น˜(patch) ๊ฐ„์˜ ์„ธ๋ถ€์ (local) align ์ •๋„๋ฅผ ํ‰๊ฐ€ : L_cls
- f_att = f_t ยท [f'_g(1:)](transpose) - f'_t: ํ…์ŠคํŠธ ์ •๋ณด - f'_g(1:) : 0๋ฒˆ์จฐ๋Š” ์ „์—ญ์ •๋ณด๋‹ˆ๊นŒ ๊ทธ๊ฑฐ ๋บด๊ณ  1๋ฒˆ์จฐ๋ถ€ํ„ฐ
- ๊ฒฐ๊ตญ, f_att๋Š” ์ง€์—ญ๋ณ„ ํŒจ์น˜๋งˆ๋‹ค์˜ ํ…์ŠคํŠธ์ •๋ณด์™€์˜ ์œ ์‚ฌ๋„ ๊ณ„์‚ฐํ•œ๊ฒƒ!!
- ์ด์  , ํ…์ŠคํŠธ ์˜๋ฏธ๊ฐ€ ๋ฐ˜์˜๋œ ์ด๋ฏธ์ง€ ํŠน์ง• ๋งต(F_g)๋ฅผ ๋งŒ๋“ ๋‹ค!! - F_g = f_g X f_att + f_g - f_g X f_att : ํ…์ŠคํŠธ ์˜๋ฏธ์™€ ๊ด€๋ จ ์žˆ๋Š” ์œ„์น˜ - + f_g : ์›๋ž˜ ์ด๋ฏธ์ง€ ํŠน์ง•์„ ๋ณด์กด - F_g๊ฐ€ 3 ร— 3 convolutional layer ๋ž‘ FC ์„ ์ง€๋‚˜ ๊ตฌ๋ถ„์ ์ˆ˜ c_ego๊ฐ€ ๋จ!!
- ๊ฒฐ๊ตญ c_ego๋Š” ์ ์ˆ˜๋กœ์„œ cross-entropy loss L_cls ๋กœ ํ™œ์šฉ๋Œ

4) Exocentric Branch

  • 1..i..n๊ฐœ์˜ Exo์ด๋ฏธ์ง€์—์„œ, DINO-ViT ๊ธฐ๋ฐ˜ feature ์ถ”์ถœ
  • ์ด๋•Œ DINO์˜ ๋งˆ์ง€๋ง‰ 2๊ฐœ Layer์˜ feature๋ฅผ ์ถ”์ถœ (f_b-1_i, f_b_i) ํ•ด์„œ concat ํ›„ MLP ํ•˜์—ฌ ์ด๋ฏธ์ง€i์—๋Œ€ํ•œ ํ”ผ์ฒ˜ (f_i_x)๋ฅผ ๋งŒ๋“ฌ
  • (AIM ๋ชจ๋“ˆ ์ ์šฉ) ์„ ํ†ตํ•ด์„œ F_i_x๋ฅผ ๊ตฌํ•จ. F_i_x๋Š” i๋ฒˆ์งธ exocentric ์ด๋ฏธ์ง€๋กœ๋ถ€ํ„ฐ ์ถ”์ถœ๋œ, action์˜ ํŠน์ง•!!
  • F_i_x๊ฐ€ 3 ร— 3 convolutional layer ๋ž‘ FC ์„ ์ง€๋‚˜ ๊ตฌ๋ถ„์ ์ˆ˜ c_exo๊ฐ€ ๋จ!!
  • c_exo๋Š” ์ ์ˆ˜๋กœ์„œ cross-entropy loss L_cls ๋กœ ํ™œ์šฉ๋Œ

5) HOI-Transfer Module


๐Ÿงช ์‹คํ—˜ ๊ฒฐ๊ณผ

๋ฐ์ดํ„ฐ์…‹ & ์ง€ํ‘œ

  • ADE20K (seen/unseen split)
  • HICO-IIF
  • ํ‰๊ฐ€ ์ง€ํ‘œ: KLD โ†“, SIM โ†‘, NSS โ†‘

๊ฒฐ๊ณผ

  • ADE20K-unseen: KLD 1.335, SIM 0.382, NSS 1.220
  • ADE20K-seen: KLD 1.176, SIM 0.416, NSS 1.247
  • HICO-IIF: KLD 1.465, SIM 0.358, NSS 1.012
  • โ†’ LOCATE, Cross-view-AG ๋“ฑ ๊ธฐ์กด ๋ชจ๋ธ ๋Œ€๋น„ ์„ฑ๋Šฅ ์šฐ์ˆ˜

์ •์„ฑ์  ๋น„๊ต (Qualitative)

  • ์ปต์˜ ์ž…๊ตฌ, ์นซ์†”์˜ ๋ ๋“ฑ ์ž‘์€ affordance ๋ถ€์œ„๋ฅผ ์ •ํ™•ํžˆ localize
  • ๋ฐฐ๊ฒฝ ๊ฐ„์„ญ ์–ต์ œ ๋ฐ unseen ๊ฐ์ฒด์—์„œ๋„ ์ผ๋ฐ˜ํ™” ์ž˜ ์ˆ˜ํ–‰

๐Ÿงช Ablation ๋ถ„์„

  • ๋ชจ๋“ˆ ์ œ๊ฑฐ ์‹คํ—˜:
    • Ego branch๋งŒ ์‚ฌ์šฉ โ†’ ์„ฑ๋Šฅ ์ €ํ•˜
    • HOI-Transfer ์ถ”๊ฐ€ โ†’ ์„ฑ๋Šฅ ๊ฐœ์„ 
    • Pixel-Text Fusion ์ถ”๊ฐ€ โ†’ ์„ฑ๋Šฅ ๋” ํ–ฅ์ƒ
    • ๋‘ ๋ชจ๋“ˆ ๋ชจ๋‘ ํฌํ•จ ์‹œ ์ตœ๊ณ  ์„ฑ๋Šฅ
  • Loss ๋ถ„์„:
    • Cross-entropy (L_cls) + CLIP alignment (L_clip) + Distillation loss (L_d) + Relation loss (L_lrela) ์กฐํ•ฉ์ด ๊ฐ€์žฅ ํšจ๊ณผ์ 

โœ… ๊ฒฐ๋ก 

  • WSMA๋Š” Multimodal ์•ฝ์ง€๋„ ํ”„๋ ˆ์ž„์›Œํฌ๋กœ egocentric affordance grounding์„ ํฌ๊ฒŒ ํ–ฅ์ƒ
  • ์ฃผ์š” ๊ธฐ์—ฌ:
    1. HOI-Transfer Module๋กœ exocentric affordance ์ง€์‹ ์ „์ด
    2. Pixel-Text Fusion Module๋กœ ํ…์ŠคํŠธโ€“์ด๋ฏธ์ง€ ๊ฒฐํ•ฉ
    3. ADE20K, HICO-IIF์—์„œ SOTA ์ˆ˜์ค€ ์„ฑ๋Šฅ ๋‹ฌ์„ฑ
  • โ†’ ๋กœ๋ด‡ ์ธ์ง€, ์ธ๊ฐ„-๋กœ๋ด‡ ์ƒํ˜ธ์ž‘์šฉ(HOI), AR/VR ๋“ฑ ์‹ค์„ธ๊ณ„ ์‘์šฉ์— ์ค‘์š”ํ•œ ๊ธฐ์—ฌ ๐ŸŽฏ

This post is licensed under CC BY 4.0 by the author.