Post

๐Ÿ” WSAG-PLSP: Weakly Supervised ํ•™์Šต์„ ํ†ตํ•œ Affordance Grounding ๋ฌธ์ œํ•ด๊ฒฐ!

๐Ÿ” WSAG-PLSP: Weakly Supervised ํ•™์Šต์„ ํ†ตํ•œ Affordance Grounding ๋ฌธ์ œํ•ด๊ฒฐ!

๐Ÿ” (ํ•œ๊ตญ์–ด) WSAG-PLSP: Part-Level Semantic Propagation์œผ๋กœ Affordance Grounding ๋ฌธ์ œ ํ•ด๊ฒฐ!

Image

  • ์ œ๋ชฉ: WSAG-PLSP: Part-Level Semantic Propagation for Weakly Supervised Affordance Grounding
  • ํ•™ํšŒ: ICLR 2025
  • ์ฝ”๋“œ/์ฒดํฌํฌ์ธํŠธ: GitHub โ€“ WSAG-PLSP
  • ์ €์ž: Peiran Xu, Yadong Mu(Peking University)
  • ํ•ต์‹ฌ ํ‚ค์›Œ๋“œ: Affordance, Weakly-Supervised, Part-Level, Semantic Propagation, Vision-Language
  • ์š”์•ฝ: WSAG-PLSP๋Š” ์ด๋ฏธ์ง€ ์ˆ˜์ค€ ๋ผ๋ฒจ๋งŒ ์ด์šฉํ•ด affordance๋ฅผ ํ•™์Šตํ•˜๋Š” Weakly Supervised Affordance Grounding (WSAG) ๋ฌธ์ œ์—์„œ, ๋ถ€์œ„ ๋‹จ์œ„(Part-Level) ์˜๋ฏธ ์ „ํŒŒ(PLSP)๋ฅผ ํ†ตํ•ด affordance ์œ„์น˜๋ฅผ ๋”์šฑ ์ •๋ฐ€ํ•˜๊ฒŒ ์ฐพ์•„๋‚ด๋Š” ์ƒˆ๋กœ์šด ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ์ œ์•ˆ. AGD20K, UMD, IIT-AFF ๋“ฑ ๋ฐ์ดํ„ฐ์…‹์—์„œ SOTA ์„ฑ๋Šฅ ๋‹ฌ์„ฑ ๐Ÿš€

๐Ÿš€ ์—ฐ๊ตฌ ํ•ต์‹ฌ ์š”์•ฝ

ํ•œ ์ค„ ์š”์•ฝ: โ€œWSAG-PLSP = Pseudo label์„ ๋งŒ๋“ค๊ณ  ํ•œ๋ฒˆ ๋” ์ •์ œํ•˜๋ฉฐ, Exo ์ด๋ฏธ์ง€์™€ ํ•จ๊ป˜ Transformer๋กœ supervised ํ•˜๊ฒŒ ํ•™์Šต์‹œํ‚จ๋‹ค!โ€

Image

1) ์ƒˆ ๊ณผ์ œ ๋ฐฐ๊ฒฝ (WSAG)

  • Weakly Supervised Affordance Grounding: ํ”ฝ์…€ ๋‹จ์œ„ ๋ ˆ์ด๋ธ” ์—†์ด affordance ์ง€์—ญ(localization) ํ•™์Šต
  • ๊ธฐ์กด CAM ๊ธฐ๋ฐ˜ ์ ‘๊ทผ์€ ๋‹จ์ˆœ ๋ถ€๋ถ„ ๊ฐ•์กฐ์— ๊ทธ์ณ affordance๊ฐ€ ํ˜•์„ฑ๋˜๋Š” ์„ธ๋ถ€ ๋ถ€์œ„ ๊ฐ„ ์˜๋ฏธ ๊ด€๊ณ„๋ฅผ ํฌ์ฐฉํ•˜์ง€ ๋ชปํ•จ

2) WSAG-PLSP ๋ฐฉ๋ฒ•๋ก 
โ‘  Pseudo Label ๋งŒ๋“ค๊ธฐ : ์ž‘์—…๋œ object part prompt(p) ๋ฅผ VLpart + SAM์„ ํ†ตํ•ด Label๋กœ ์ œ์ž‘
โ‘ก Refine Label : exo์˜ ๊ฒน์นจํ˜„์ƒ์ด ๋ฐœ์ƒํ•˜๋Š” ๊ฐœ๋…์—์„œ ์ฐฉ์•ˆ, ์ผ๋ถ€๋Š” Pretrained Label ์ œ์ž‘
โ‘ข Supervised Baseline : Cross modal fuser(transformer) ๊ตฌ์กฐ๋กœ Label์„ ํ•™์Šต
โ‘ฃ Exo ์ด๋ฏธ์ง€ ํ™œ์šฉ: ๊ฐ€์žฅ ์œ ์‚ฌํ•œ 1๊ฐœ ์ด๋ฏธ์ง€๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ Align

3) ์ตœ์ข… ์ถœ๋ ฅ

  • affordance heatmap (ํ”ฝ์…€ ๋‹จ์œ„)
  • object๋ณ„ affordance presence score

๐Ÿ” ๊ธฐ์กด์˜ ๊ด€๋ จ ์—ฐ๊ตฌ๋“ค!

  1. Affordance ๋ผ๋Š” ๊ฐœ๋…์˜ ๋“ฑ์žฅ! : 1977๋…„, ์‹ฌ๋ฆฌํ•™ ๊ด€์ ์—์„œ Gibson์ด ์ œ์•ˆํ•œ ๊ฐœ๋…!!
    • ์ตœ๊ทผ์—๋Š” ๋กœ๋ด‡์— ์ ์šฉํ•œ AI๋กœ์„œ ๋งŽ์ด ์—ฐ๊ตฌ๋จ!
  2. Fully supervised : ์ดˆ๊ธฐ์—๋Š” ์™„์ „ ์ง€๋„ํ•™์Šต์œผ๋กœ ์—ฐ๊ตฌ!!
    • ๋†’์€ ์„ฑ๋Šฅ, ํ•˜์ง€๋งŒ ๋ผ๋ฒจ๋ง ๋น„์šฉยท์ฃผ๊ด€์„ฑ ๋ฌธ์ œ๋กœ ํด๋ž˜์Šค ๋‹ค์–‘์„ฑ ๋ถ€์กฑ (Myers et al., 2015; Nguyen et al., 2017).
  3. Weakly Supervised ๊ด€์ ์˜ ์—ฐ๊ตฌ๋“ค์ด ๋“ฑ์žฅํ•จ: ์ด๋ฏธ์ง€ ์ˆ˜์ค€ ๋ผ๋ฒจ(+ exo-centric์ด๋ฏธ์ง€)๋งŒ ์ œ๊ณตํ•˜๊ณ  ์˜ˆ์ธกํ•˜๊ธฐ!
    • ๊ธฐ์กด ๋ฐฉ๋ฒ•(CROSS-VIEW-AG, LOCATE, WSMA)๋“ค์€ ๋Œ€๋ถ€๋ถ„ CAM ๊ธฐ๋ฐ˜์œผ๋กœ affordance ๋ถ„๋ฅ˜, ๊ทธ๋Ÿฐ๋ฐ CAM์€ ๋‘๋“œ๋Ÿฌ์ง„ ๋ถ€๋ถ„๋งŒ ๊ฐ•์กฐํ•ด affordance ์ „ ์˜์—ญ ํฌ์ฐฉ์ด ์–ด๋ ต๋‹ค๋Š” ๋‹จ์ ์ด ์กด์žฌ.
    • ๋˜ํ•œ exocentric images ํ™œ์šฉ์—์„œ๋„, ๊ธฐ์กด ๋ฐฉ์‹๋“ค์€ ๊ธ€๋กœ๋ฒŒ ํ’€๋ง(CROSS-VIEW-AG)/๋งˆ์Šคํ‚น ํ’€๋ง(LOCATE) ๋ฐฉ์‹์ด๋ผ ๋…ธ์ด์ฆˆ ์œ ์ž… ๊ฐ€๋Šฅ
  4. Visual Foundation Models (VFM)๊ณผ Multi-modal LLMs (MLLMs)์˜ ๋ฐœ์ „
    • SAM, CLIP, VL-part๋“ฑ ์—ฌ๋ ค ์—ฐ๊ตฌ๊ฐ€ ์žˆ์–ด์™”๊ณ  ์„ฑ๋Šฅ๋„ ์ข‹๋‹ค!
    • ์ด๋“ค์„ ํ†ตํ•ด ์ œ๋กœ์ƒท์œผ๋กœ ๊ณ ํ’ˆ์งˆ dense annotation ๊ฐ€๋Šฅ

(์ฐธ๊ณ ) ๊ฐ ์—ฐ๊ตฌ๋ณ„ reference

  • CROSS-VIEW-AG : Luo, Hongchen, et al. โ€œLearning affordance grounding from exocentric images.โ€ Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022.
  • LOCATE : Li, Gen, et al. โ€œLocate: Localize and transfer object parts for weakly supervised affordance grounding.โ€ Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023.
  • WSMA : Xu, Lingjing, et al. โ€œWeakly supervised multimodal affordance grounding for egocentric images.โ€ Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 38. No. 6. 2024.

๐Ÿ” ๋ณธ ์—ฐ๊ตฌ์˜ ๋ฐฉ๋ฒ•๋ก !!!

3.2 ๋ชจ๋ธ Architecture

Image

  • Enc_V : ์ด๋ฏธ์ง€ ์ธ์ฝ”๋”ฉ. ์ด๋ฏธ์ง€ I๋ฅผ ์ธ์ฝ”๋”ฉํ•˜์—ฌ F_V๋กœ ๋ณ€ํ™˜
  • Enc_T : ํ…์ŠคํŠธ ์ธ์ฝ”๋”ฉ. afforance query(a)๋ฅผ ๋ฒกํ„ฐ f_T๋กœ ๋ณ€ํ™˜(CLIP๊ธฐ๋ฐ˜)
  • Cross Modal Fuser: f_T๋ž‘ F_V๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ํ†ตํ•ฉ๋œ affordance grounding ์ •๋ณด ๋ฒกํ„ฐ f_A ์ƒ์„ฑ(Transformer ๋ธ”๋ก ๊ธฐ๋ฐ˜ cross-attention (query = f_T, key/value = F_V))
  • Dec : F_V์™€ f_A๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ๋””์ฝ”๋”ฉํ•˜์—ฌ ์ตœ์ข… ํžˆํŠธ๋งต H_pred ์ƒ์„ฑ(SAM ๊ธฐ๋ฐ˜)

3.3 PSEUDO LABELS(H_pl) ๋งŒ๋“ค๊ธฐ

  • ๋ณธ ์—ฐ๊ตฌ์˜ ์ค‘์š”ํ•œ์ ์€, VLM์„ ํ™œ์šฉํ•˜์—ฌ ego์ด๋ฏธ์ง€์— ๋Œ€ํ•œ pseudo label(H_pl)์„ ๋งŒ๋“ค์–ด์„œ supervised ๋ฐฉ์‹์œผ๋กœ ํ•™์Šตํ•˜๋Š”๊ฒƒ!!
  • 2 step์œผ๋กœ ์ด๋ฃจ์–ด์ง

Image

Step1) part name(p) ๋งŒ๋“ค๊ธฐ!

  • ์ง€๊ธˆ์˜ affordance ๋Š” ๋ชจ๋‘ action(verb)์ด๊ธฐ์—, ๊ธฐ์กด VLM์ด ์ž˜ํ•˜๋Š” ๋ช…์‚ฌ ์˜ˆ์ธก๊ณผ๋Š” ์ฐจ์ด๊ฐ€ ์žˆ๋‹ค.
  • ํ•œํŽธ ์ฃผ์–ด์ง„ ๊ฐ์ฑ„(o)์˜ affordance๋Š” ๊ฐ์ฒด์˜ ์ผ๋ถ€๋ถ„์ด๋‹ค!
  • ๊ทธ๋ž˜์„œ! P(o,a) ๋กœ ๊ฐ์ฑ„์˜ ๋ถ€๋ถ„๋ช…์‚ฌ๋ฅผ ์ถ”์ถœํ• ์ˆ˜ ์žˆ๊ฒŒํ•จ (ex. P(knife, hold) = handle of the knife)
  • ์ด๋•Œ ํ•จ์ˆ˜ P๋Š” LLM์„ ํ™œ์šฉํ• ์ˆ˜๋„ ์žˆ์œผ๋ฉฐ ์ด ์—ฐ๊ตฌ๋Š” ์ˆ˜์ž‘์—…์œผ๋กœ!!

Step2)ํ•™์Šต๋Œ€์ƒ์ธ H_pl ๋งŒ๋“ค๊ธฐ

  • VLpart๋ผ๋Š” ๋ชจ๋ธ์„ ํ™œ์šฉ, p๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ bbox๋ฅผ ๋งŒ๋“ค๊ณ ,
  • SAM์œผ๋กœ bbox๋‚ด์˜ segmentation ์ง„ํ–‰!! mask์ธ M_ego_part ์ถ”์ถœ!
  • M_ego_part๋ฅผ ๋ณ€ํ™˜ํ•ด์„œ heatmapํ˜•์‹์˜ H_pl์„ ๋งŒ๋“ค๊ณ , ์šฐ๋ฆฌ ๋ชจ๋ธ์˜ ๊ฒฐ๊ณผ H_pred์™€ ๊ฐ’ ๋น„๊ต!

๋‹ค๋งŒ!) VLpart ๊ฐ€ ์—ญํ• ์„ ์ž˜ ๋ชปํ•˜๋Š” ๋ฌธ์ œ, ํ˜น์€ ๋ช…ํ™•ํ•œ p๋ฅผ ๋งŒ๋“ค์ˆ˜ ์—†๋Š” ๋ฌธ์ œ๊ฐ€ ์žˆ์Œ!!

  • ์ด๋กœ์ธํ•ด H_pl์˜ ์ •ํ™•๋„๊ฐ€ ๋‚ฎ์•„์ง€๋Š”๋ฌธ์ œ๊ฐ€ ์žˆ์–ด!!

3.4 EXOCENTRIC ์ด๋ฏธ์ง€ ํ™œ์šฉํ•˜๊ธฐ!

Image

  • ์ง€๊ธˆ๊นŒ์ง€๋Š” ego๋งŒ ์ผ๋‹ค!
  • Exo-centric ์ด๋ฏธ์ง€(I_exo) > Enc_V > F_exo_V
  • I_ego > Enc_V > F_ego_V
  • ์ด F_ego_V, F_exo_V ๋ฅผ GAPํ•˜์—ฌ object action ์˜ ํŠน์ง•์„ ์ถ”์ถœํ•˜๋ฉด!! ๋ฐฐ๊ฒฝ์ด๋ผ๋˜์ง€ ์žก์Œ์ด ๋“ค์–ด๊ฐ„๋‹ค!
  • ๊ทธ๋ž˜์„œ, ๋‹ค์‹œ VLpart, SAM์ด ๋“ฑ์žฅํ•œ๋‹ค!.
    • I_exo > VLpart with object > SAM > M_exo_obj
    • M_exo_obj ๋ž‘ F_exo_B๋ฅผ average-pooling ํ•ด์„œ f_E ๊ตฌํ•˜๊ณ ( ์ด๋ฏธ์ง€์—์„œ์˜ object ๋ฒกํ„ฐ),
    • Cross Modal Fuser์˜ ๊ฒฐ๊ณผ๋ฌผ์ธ f_A์™€ f_E๋ฅผ align ์‹œํ‚จ๋‹ค!!(L_align)
    • ๋งˆ์ง€๋ง‰์œผ๋กœ MLP์ธ Head_exo(f_E)์™€ one-hot encoding๋œ a^๋ฅผ cross entropy loss ๋กœ ์ผ์น˜ํ™”์‹œ์ผœ, action์„ ์˜ˆ์ธกํ•˜๋Š” Head_exo๋„ ํ•™์Šต์‹œํ‚จ๋‹ค!(L_exo_cls)
    • ์ด๋•Œ Visual Encoder๋„ Fine tuning ๋˜์–ด์„œ f_E์ถœ๋ ฅ์— affordance์— ๋Œ€ํ•œ ์ดํ•ด๊ฐ€ ๋ฐ˜์˜๋˜์–ด์žˆ๋‹ค
  • ๋˜ํ•œ ๊ธฐ์กด ์—ฐ๊ตฌ๋Š” ์—ฌ๋Ÿฌ exo ์ด๋ฏธ์ง€๋ฅผ ์‚ฌ์šฉํ–ˆ๋Š”๋ฐ, ๊ทธ๊ฒŒ ์žฅ์ ์ด ์žˆ๋‚˜ ์‹ถ์–ด ์ด ์—ฐ๊ตฌ๋Š” 1๊ฐœ์˜ exo์ด๋ฏธ์ง€๋งŒ ์‚ฌ์šฉํ–ˆ๋‹ค.
    • ๊ทธ๋ฆฌ๊ณ ! ์ด๋•Œ ๊ฐ™์€ object๊ฐ€ ์žˆ์œผ๋ฉด ๋” ์ข‹์œผ๋‹ˆ ego์ด๋ฏธ์ง€์˜ object๋ž‘ ๊ฐ€์žฅ ์œ ์‚ฌํ•œ ์ด๋ฏธ์ง€๋ฅผ ๊ณจ๋ž๋‹ค!

3.5 PSEUDO LABELS ๊ณ ๋„ํ™”ํ•˜๊ธฐ

pretrain ๋‹จ๊ณ„: cross modal fuser๋ฅผ ํ•™์Šต์‹œ์ผœ์„œ ego ์ด๋ฏธ์ง€์—์„œ M_ego_pred๋ฅผ ๋งŒ๋“ค๋„๋ก ํ•œ๋‹ค. ์ด M_ego_pred๋Š” ๋ฌผ์ฒด์—์„œ exo ์ด๋ฏธ์ง€ ๊ฐ€๋ ค์ง„ ๋ถ€๋ถ„๋งŒ ์ถ”์ถœํ•˜๊ฒŒ๋œ๋‹ค

  • exo ์ด๋ฏธ์ง€์˜ ๋˜๋‹ค๋ฅธ ํŠน์ง•! ์‚ฌ๋žŒ ๋ชธ์— ์˜ํ•ด์„œ object์— ๊ฐ€๋ ค์ง์ด ๋ฐœ์ƒํ•œ๋‹ค!
  • Image(ego or Exo) > VLpart object ๋กœ bbox๋งŒ๋“ค์–ด์„œ crop! > Enc_V`(DINO๋‚˜ CLIP) > G(ego or exo) ์ถ”์ถœ!!
  • G_ego, G_exo๋Š” ๋ฌผ์ฒด์— ๋Œ€ํ•œ ์ด๋ฏธ์ง€ ๋ฒกํ„ฐ์ž„!!
  • ๊ทธ๋‹ค์Œ, 3.4์˜ ๋ฐฉ๋ฒ•์œผ๋กœ M_exo_obj ๋ฅผ ๋งŒ๋“ค์ˆ˜ ์žˆ๊ณ  G_exo ํ•ด์„œ ์‹ค์ œ ๋ฌผ์ฒด ๋ถ€๋ถ„์˜ ๋ฒกํ„ฐ๋ฅผ ์ถ”์ถœ, M~_exo_obj ๊ฐ€ ๋œ๋‹ค.
  • ํ•œํŽธ, H_pred๋ฅผ ๋งŒ๋“œ๋Š”๊ฑฐ์— softmax๋ฅผ sigmod๋กœ ๋ฐ”๊พธ์–ด M_ego_pred๋ฅผ ๋งŒ๋“ค์ˆ˜ ์žˆ๋‹ค!
    • ๊ทธ๋ž˜์„œ!! 1 - M_ego_pred๋ฅผ ํ•œ๋‹ค์Œ(๊ทธ๋Ÿผ ๋ฐฐ๊ฒฝ๋ถ€๋ถ„์„ ๋ฐ”๋€œ) bbox๋งŒ ์ถ”์ถœํ•˜๋ฉด M~_ego_obj๊ฐ€ ๋œ๋‹ค
  • ์ด์ œ, M~_ego_obj, M~_exo_obj ๋ฅผ Cross modal Fuser์— L_pretrain๋กœ์Šค๋กœ ํ•™์Šต ์ผ์น˜ํ™” ํ•˜๋Š” ์ž‘์—…์„ ํ•˜๋ฉด!
    • ์ด ๋œป์€ ego์˜ ์˜ˆ์ธก๊ฒฐ๊ณผ๋ฅผ ๋บ€ ๋‚˜๋จธ์ง€ ๋ถ€๋ถ„์ด! exo ์‚ฌ์šฉ์ด๋ฏธ์ง€๋ž‘ ๊ฐ™์•„์ง„๋‹ค! ์ฆ‰ ์˜ˆ์ธก๊ฒฐ๊ณผ ๋ถ€๋ถ„์€ ๊ฐ€๋ ค์ง„๋ถ€๋ถ„์ด ๋œ๋‹ค๋Š”๊ฒƒ!
  • ๊ฒฐ๊ตญ I_ego > Cross modal Fuser > M~_ego_pred ์„ ์˜ˆ์ธกํ•˜๊ฒŒ๋จ!!
  • ๊ทธ๋Ÿฐ๋ฐ ๊ฒฐ๊ณผ๋ฌผ M~_ego_pred ์ด ๋ช…ํ™•ํžˆ ๋ฌผ์ฒด๋ฅผ ๋‚˜๋ˆ„์ง€๋ชปํ•˜๊ณ  ์‚๋šค๋บด๋šค ํ•  ์ˆ˜ ์žˆ์œผ๋‹ˆ, SAM๊ณผ ๋น„๊ตํ•ด๊ฐ€๋ฉฐ ์ถ”์ถœํ•œ๋‹ค.

3.6 Unseen ์ฒ˜๋ฆฌ๋Š”!?

  • AGD20K ์—๋Š” Unseen๋„ ์žˆ๋Š”๊ฑฐ์•Œ์ง€? Unseen ์ด๋ž€ Seen์—์„œ ์—†๋˜ object/action ์กฐํ•ฉ์ด์•ผ!
    • ์˜ˆ๋ฅผ๋“ค๋ฉด hold bottle ๋งŒ ๋ณธ์ ์žˆ๋Š” ์ƒํƒœ์ธ๋ฐ hold cup ์„ ๋ฌผ์–ด๋ด„!!
  • ๊ทธ๋ž˜์„œ reasoning module ์ด๋ž€๊ฑธ ๋‘ฌ์„œ object ์™€ action์˜ ๊ด€๊ณ„๋ฅผ ํŒŒ์•…!
  • ์ ˆ์ฐจ๋Š” ์•„๋ž˜์™€ ๊ฐ™์Œ
    a. I_ego > Env_V > Transformer ๊ฒฐ๊ณผ๋ฌผ์˜ CLSํ† ํฐ ์ถ”์ถœ > c_V
    b. MLP_noun์— c_V๋ฅผ ๋„ฃ์–ด์„œ f_pred_obj, ์ฆ‰ object ์˜ˆ์ธก
    c. f_pred_obj ๋ž‘ action์˜ ๋ฒกํ„ฐ f_T๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ MLP_part์— ๋„ฃ์–ด์„œ f_pred_part ์ถ”์ถœ(Seen์—์„œ๋Š” part name์„ ์ง์ ‘ ์ž‘์—…ํ–ˆ์—ˆ์ง€)
    d. ๊ทธ๋Ÿฐ๋‹ค์Œ object part(p)์˜ ์ธ์ฝ”๋”ฉ Enc_T(p)์™€ f_pred_part๊ฐ€ ๊ฐ™๋„๋ก, object์˜ ์ธ์ฝ”๋”ฉ Enc_T(o)์™€ f_pred_obj๊ฐ€ ๊ฐ™๋„๋ก ํ•˜๋Š” L_reason ์„ ๋‚˜์˜ค๋„๋กํ•ด์„œ 2๊ฐœ์˜ MLP ํ•™์Šต์‹œํ‚ด!!

    ์ตœ์ข… Loss๋Š”!?

    L_all = L_KL + ฮป1(L_align + L_exo_cls) + ฮป2L_reason

    • L_KL: cross modal fuser์˜ H_pred์™€ H_pl์˜ ์ฐจ์ด๋กœ์Šค
    • L_align : exo ์ด๋ฏธ์ง€ ์‚ฌ์šฉํ•˜๋ฉฐ, ego์™€ ego์–ผ๋ผ์ธ์‹œํ‚ค๊ธฐ
    • L_exo_cls : exo ์ด๋ฏธ์ง€ ๊ธฐ๋ฐ˜,
    • L_reason : Unseen์„ ์œ„ํ•œ action, object + part ํ•™์Šต ๋กœ์Šค

๐Ÿงช ์‹คํ—˜ ๊ฒฐ๊ณผ ๋ฐ Ablation

Ablation Test

Image

  • base์—์„œ๋„ ์„ฑ๋Šฅ์ด ๋งŽ์ด ํ–ฅ์ƒ๋˜์—ˆ๊ณ !
  • Refinement, pseudo label์ด ์ข‹์•„์ง์— ๋”ฐ๋ฅธ ์„ฑ๋Šฅํ–ฅ์ƒ์ด ๋ณด์˜€๊ณ ,
  • Unseen์„ ์œ„ํ•œ reasoning์—์„œ ํ™•์‹คํžˆ ํ–ฅ์ƒ๋จ์ด ๋ณด์˜€๋‹ค.

Image

์ตœ์ข… ์„ฑ๋Šฅ๋„ ์–ด๋งˆ์–ด๋งˆํ–ˆ๋‹ค!!


โœ… ๊ฒฐ๋ก 

  • WSAG-PLSP๋Š” Part-Level + Semantic Propagation์„ ๋„์ž…ํ•ด weak supervision ํ™˜๊ฒฝ์—์„œ๋„ affordance localization ์„ฑ๋Šฅ์„ ํฌ๊ฒŒ ๊ฐœ์„ 
  • ์ฃผ์š” ๊ธฐ์—ฌ:
    1. Part-level representation ํ•™์Šต์œผ๋กœ affordance ๋‹จ์œ„ ์„ธ๋ถ„ํ™”
    2. Semantic Propagation Module๋กœ affordance ์˜๋ฏธ ํ™•์‚ฐ
    3. ๋‹ค์–‘ํ•œ ๋ฐ์ดํ„ฐ์…‹์—์„œ SOTA ์ˆ˜์ค€ ์„ฑ๋Šฅ ์ž…์ฆ
  • โ†’ ๋กœ๋ด‡ ์ง€๊ฐ, ์ธ๊ฐ„-๋กœ๋ด‡ ์ƒํ˜ธ์ž‘์šฉ, AR/VR ์‘์šฉ์— ์œ ์šฉ ๐ŸŽฏ

This post is licensed under CC BY 4.0 by the author.