Post

๐Ÿ‘ฅ GroupHOI: Human-Object Interaction์„ Group์œผ๋กœ ํ•™์Šตํ•˜๊ธฐ (NeurIPS 2025)

๐Ÿ‘ฅ GroupHOI: Human-Object Interaction์„ Group์œผ๋กœ ํ•™์Šตํ•˜๊ธฐ (NeurIPS 2025)

๐Ÿ‘ฅ GroupHOI: Learning Human-Object Interaction as Groups ๋…ผ๋ฌธ ์ฝ๊ธฐ!

  • ์ œ๋ชฉ: Learning Human-Object Interaction as Groups
  • ํ”„๋กœ์ ํŠธ: https://github.com/JiajunHong1/GroupHOI
  • ์ €์ž: Jiajun Hong, Jianan Wei, Wenguan Wang
  • ์†Œ์†: Zhejiang University
  • ๋ฐœํ‘œ: NeurIPS 2025
    README citation ๊ธฐ์ค€: The Thirty-ninth Annual Conference on Neural Information Processing Systems
  • ํ•ต์‹ฌ ํ‚ค์›Œ๋“œ: Human-Object Interaction Detection, Group-based HOI Learning, DETR, CLIP, HICO-DET, V-COCO
  • ํ•œ ์ค„ ์š”์•ฝ: HOI Detection์„ ๋‹จ์ˆœํ•œ human-object pair ๋ถ„๋ฅ˜ ๋ฌธ์ œ๊ฐ€ ์•„๋‹ˆ๋ผ, ์ด๋ฏธ์ง€ ์•ˆ์—์„œ ํ•จ๊ป˜ ์ƒํ˜ธ์ž‘์šฉํ•˜๋Š” ์‚ฌ๋žŒ๊ณผ ๊ฐ์ฒด๋“ค์˜ group ๊ตฌ์กฐ๋กœ ํ•™์Šตํ•˜๋Š” ๋ฐฉํ–ฅ์„ ์ œ์•ˆํ•œ๋‹ค!! ๐Ÿš€

๐Ÿ›๏ธ ์–ด๋””์—์„œ ๋ฐœํ‘œ๋œ ์—ฐ๊ตฌ์ธ๊ฐ€?

GroupHOI๋Š” GitHub README์˜ citation ๊ธฐ์ค€์œผ๋กœ NeurIPS 2025์— ๋ฐœํ‘œ๋œ ์—ฐ๊ตฌ๋‹ค.

NeurIPS๋Š” ๋จธ์‹ ๋Ÿฌ๋‹๊ณผ ์ธ๊ณต์ง€๋Šฅ ๋ถ„์•ผ์—์„œ ๊ฐ€์žฅ ์˜ํ–ฅ๋ ฅ์ด ํฐ ํ•™ํšŒ ์ค‘ ํ•˜๋‚˜๋‹ค.
์ด ๋…ผ๋ฌธ์€ Human-Object Interaction Detection, ์ค„์—ฌ์„œ HOI Detection ๋ฌธ์ œ๋ฅผ ๋‹ค๋ฃฌ๋‹ค.

GitHub ์ €์žฅ์†Œ๋ฅผ ๋ณด๋ฉด ๋‹ค์Œ ํŠน์ง•์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค.

  • HICO-DET, V-COCO ํ‰๊ฐ€ ์ง€์›
  • DETR ResNet-50 pretrained detector ์‚ฌ์šฉ
  • CLIP ์„ค์น˜ ๋ฐ ์‚ฌ์šฉ ํ•„์š”
  • with_clip_label, with_obj_clip_label ์˜ต์…˜ ์‚ฌ์šฉ
  • PPDM, DETR, QPIC, CDN, GEN-VLKT ์ฝ”๋“œ ๊ธฐ๋ฐ˜ ์ผ๋ถ€ ํ™œ์šฉ
  • HICO-DET์™€ V-COCO pretrained model ๋ฐ config ์ œ๊ณต

์ฆ‰, GroupHOI๋Š” ๊ธฐ์กด HOI detection ๊ณ„์—ด ์—ฐ๊ตฌ ํ๋ฆ„ ์œ„์—์„œ,
์‚ฌ๋žŒ-๊ฐ์ฒด ์ƒํ˜ธ์ž‘์šฉ์„ group์ด๋ผ๋Š” ๊ด€์ ์œผ๋กœ ์žฌํ•ด์„ํ•˜๋ ค๋Š” ์—ฐ๊ตฌ๋ผ๊ณ  ๋ณผ ์ˆ˜ ์žˆ๋‹ค.


๐Ÿš€ ์—ฐ๊ตฌ ํ•ต์‹ฌ ์š”์•ฝ

ํ•œ ์ค„ ์š”์•ฝ: โ€œ์‚ฌ๋žŒ-๊ฐ์ฒด ์ƒํ˜ธ์ž‘์šฉ์„ pair ๋‹จ์œ„๋กœ๋งŒ ๋ณด์ง€ ๋ง๊ณ , interaction์ด ๋ฐœ์ƒํ•˜๋Š” group ๋‹จ์œ„๋กœ ์ดํ•ดํ•˜์ž!โ€

Human-Object Interaction Detection, ์ฆ‰ HOI Detection์€ ์ด๋ฏธ์ง€ ์•ˆ์—์„œ ์‚ฌ๋žŒ์ด ์–ด๋–ค ๊ฐ์ฒด์™€ ์–ด๋–ค ํ–‰๋™์„ ํ•˜๋Š”์ง€ ์ฐพ๋Š” ๋ฌธ์ œ๋‹ค.

์ผ๋ฐ˜์ ์ธ ์ถœ๋ ฅ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์€ triplet ํ˜•ํƒœ๋‹ค.

  • <person, ride, bicycle>
  • <person, hold, cup>
  • <person, eat, sandwich>
  • <person, sit on, chair>
  • <person, carry, backpack>

๊ธฐ์กด HOI Detection์€ ๋Œ€์ฒด๋กœ ๋‹ค์Œ ํ๋ฆ„์„ ๋”ฐ๋ฅธ๋‹ค.

  1. ์ด๋ฏธ์ง€์—์„œ ์‚ฌ๋žŒ๊ณผ ๊ฐ์ฒด๋ฅผ ์ฐพ๋Š”๋‹ค.
  2. ๊ฐ€๋Šฅํ•œ human-object pair๋ฅผ ๋งŒ๋“ ๋‹ค.
  3. ๊ฐ pair๊ฐ€ ์–ด๋–ค interaction verb๋ฅผ ๊ฐ€์ง€๋Š”์ง€ ์˜ˆ์ธกํ•œ๋‹ค.
  4. ์ตœ์ข…์ ์œผ๋กœ <human, verb, object> triplet์„ ์ถœ๋ ฅํ•œ๋‹ค.

ํ•˜์ง€๋งŒ GroupHOI๋Š” ์ œ๋ชฉ ๊ทธ๋Œ€๋กœ ์—ฌ๊ธฐ์„œ ํ•œ ๊ฑธ์Œ ๋” ๋‚˜์•„๊ฐ„๋‹ค.

โ€œ์‹ค์ œ ์ด๋ฏธ์ง€ ์† ์ƒํ˜ธ์ž‘์šฉ์€ ์ •๋ง ๋…๋ฆฝ์ ์ธ pair๋“ค์˜ ์ง‘ํ•ฉ์ผ๊นŒ?โ€

์‚ฌ๋žŒ์ด ์‹ํƒ์—์„œ ๋ฐฅ์„ ๋จน๋Š” ์žฅ๋ฉด์„ ์ƒ๊ฐํ•ด๋ณด์ž.

  • ์‚ฌ๋žŒ์ด ์ˆŸ๊ฐ€๋ฝ์„ ์žก๊ณ  ์žˆ๊ณ 
  • ๊ทธ๋ฆ‡์„ ๋ณด๊ณ  ์žˆ์œผ๋ฉฐ
  • ์Œ์‹๊ณผ ์ƒํ˜ธ์ž‘์šฉํ•˜๊ณ 
  • ํ…Œ์ด๋ธ” ์œ„์—๋Š” ์ปต, ์ ‘์‹œ, ํฌํฌ๊ฐ€ ํ•จ๊ป˜ ์žˆ๋‹ค

์ด ์žฅ๋ฉด์—์„œ ๊ฐ๊ฐ์˜ interaction์„ ์™„์ „ํžˆ ๋…๋ฆฝ์ ์ธ pair๋กœ๋งŒ ๋ณด๋ฉด, ์žฅ๋ฉด ์ „์ฒด์˜ ๋ฌธ๋งฅ์„ ๋†“์น  ์ˆ˜ ์žˆ๋‹ค.

GroupHOI์˜ ํ•ต์‹ฌ ๊ด€์ ์€ ๋ฐ”๋กœ ์ด๊ฒƒ์ด๋‹ค.

HOI๋Š” pair๋“ค์˜ ๋‚˜์—ด์ด ์•„๋‹ˆ๋ผ, ์žฅ๋ฉด ์•ˆ์—์„œ ์˜๋ฏธ ์žˆ๊ฒŒ ๋ฌถ์ธ interaction group์œผ๋กœ ์ดํ•ดํ•ด์•ผ ํ•œ๋‹ค.


๐Ÿ” HOI Detection์ด ์–ด๋ ค์šด ์ด์œ !

1. ์‚ฌ๋žŒ๊ณผ ๊ฐ์ฒด๊ฐ€ ์žˆ๋‹ค๊ณ  interaction์ด ํ•ญ์ƒ ์žˆ๋Š” ๊ฒƒ์€ ์•„๋‹ˆ๋‹ค

์ด๋ฏธ์ง€ ์•ˆ์— ์‚ฌ๋žŒ๊ณผ ์ž์ „๊ฑฐ๊ฐ€ ์žˆ๋‹ค๊ณ  ํ•ด์„œ ๋ฌด์กฐ๊ฑด ride bicycle์€ ์•„๋‹ˆ๋‹ค.

์‚ฌ๋žŒ์€ ์ž์ „๊ฑฐ๋ฅผ ํƒˆ ์ˆ˜๋„ ์žˆ๊ณ , ๋Œ ์ˆ˜๋„ ์žˆ๊ณ , ๊ณ ์น  ์ˆ˜๋„ ์žˆ๊ณ , ๊ทธ๋ƒฅ ์˜†์— ์„œ ์žˆ์„ ์ˆ˜๋„ ์žˆ๋‹ค.

์ฆ‰, HOI Detection์€ ๋‹จ์ˆœ object detection๋ณด๋‹ค ํ›จ์”ฌ ์–ด๋ ต๋‹ค.

  • ์‚ฌ๋žŒ๊ณผ ๊ฐ์ฒด์˜ ์œ„์น˜
  • ์‚ฌ๋žŒ์˜ ์ž์„ธ
  • ๊ฐ์ฒด์˜ ์ข…๋ฅ˜
  • ์†๊ณผ ๊ฐ์ฒด์˜ ์ ‘์ด‰ ์—ฌ๋ถ€
  • ์ฃผ๋ณ€ ์žฅ๋ฉด ๋ฌธ๋งฅ
  • ๋‹ค๋ฅธ ๊ฐ์ฒด๋“ค๊ณผ์˜ ๊ด€๊ณ„

๋ฅผ ํ•จ๊ป˜ ๋ด์•ผ ํ•œ๋‹ค.


2. Pair ๋‹จ์œ„ ๋ชจ๋ธ๋ง์€ context๋ฅผ ์žƒ๊ธฐ ์‰ฝ๋‹ค

๊ธฐ์กด ๋ฐฉ์‹์—์„œ๋Š” ๋ณดํ†ต human-object pair๋ฅผ ํ•˜๋‚˜์”ฉ ๋ถ„๋ฆฌํ•ด์„œ ๋ณธ๋‹ค.

์˜ˆ๋ฅผ ๋“ค์–ด ์ด๋ฏธ์ง€ ์•ˆ์— ์‚ฌ๋žŒ์ด 3๋ช…, ๊ฐ์ฒด๊ฐ€ 10๊ฐœ ์žˆ๋‹ค๋ฉด ๊ฐ€๋Šฅํ•œ pair๋Š” 30๊ฐœ๊ฐ€ ๋œ๋‹ค.

๊ฐ pair๋ฅผ ๋…๋ฆฝ์ ์œผ๋กœ ๋ถ„๋ฅ˜ํ•˜๋ฉด ๊ตฌ์กฐ๋Š” ๋‹จ์ˆœํ•˜์ง€๋งŒ, ๋‹ค์Œ ๋ฌธ์ œ๊ฐ€ ์ƒ๊ธด๋‹ค.

  • ์„œ๋กœ ๊ด€๋ จ๋œ interaction๋“ค์„ ํ•จ๊ป˜ ์ดํ•ดํ•˜๊ธฐ ์–ด๋ ต๋‹ค.
  • ํ•œ ์žฅ๋ฉด ์•ˆ์˜ ํ™œ๋™ ๋งฅ๋ฝ์„ ์ถฉ๋ถ„ํžˆ ๋ฐ˜์˜ํ•˜๊ธฐ ์–ด๋ ต๋‹ค.
  • ๊ฐ™์€ ์‚ฌ๋žŒ์„ ์ค‘์‹ฌ์œผ๋กœ ์—ฌ๋Ÿฌ ๊ฐ์ฒด๊ฐ€ ์—ฐ๊ฒฐ๋œ ์ƒํ™ฉ์„ ์ž˜ ๋‹ค๋ฃจ๊ธฐ ์–ด๋ ต๋‹ค.
  • ์—ฌ๋Ÿฌ ์‚ฌ๋žŒ์ด ๊ฐ™์€ ๊ฐ์ฒด ๋˜๋Š” ๊ฐ™์€ ํ™œ๋™์— ์ฐธ์—ฌํ•˜๋Š” ์žฅ๋ฉด์„ ์ดํ•ดํ•˜๊ธฐ ์–ด๋ ต๋‹ค.

์˜ˆ๋ฅผ ๋“ค์–ด ์•ผ๊ตฌ ์žฅ๋ฉด์—์„œ๋Š” ์‚ฌ๋žŒ, ๋ฐฐํŠธ, ๊ณต, ๊ธ€๋Ÿฌ๋ธŒ, ๋ฒ ์ด์Šค๊ฐ€ ํ•จ๊ป˜ ์˜๋ฏธ๋ฅผ ๋งŒ๋“ ๋‹ค.
์ด๋•Œ <person, hold, bat>๋งŒ ๋”ฐ๋กœ ๋ณด๊ฑฐ๋‚˜ <person, hit, ball>๋งŒ ๋”ฐ๋กœ ๋ณด๋ฉด, ์ „์ฒด action scene์„ ์ถฉ๋ถ„ํžˆ ์ดํ•ดํ•˜์ง€ ๋ชปํ•  ์ˆ˜ ์žˆ๋‹ค.


3. HOI๋Š” ๋ณธ์งˆ์ ์œผ๋กœ compositionalํ•˜๋‹ค

HOI๋Š” ๋ณดํ†ต verb + object ์กฐํ•ฉ์œผ๋กœ ์ •์˜๋œ๋‹ค.

์˜ˆ๋ฅผ ๋“ค๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

  • hold cup
  • hold phone
  • ride bicycle
  • ride horse
  • eat apple
  • eat sandwich
  • sit on chair
  • sit on bench

์ด์ฒ˜๋Ÿผ verb์™€ object๊ฐ€ ์กฐํ•ฉ๋˜๊ธฐ ๋•Œ๋ฌธ์— category ์ˆ˜๊ฐ€ ๋งŽ๊ณ , long-tail ๋ฌธ์ œ๊ฐ€ ์‹ฌํ•˜๋‹ค.

์ž์ฃผ ๋“ฑ์žฅํ•˜๋Š” interaction์€ ํ•™์Šตํ•˜๊ธฐ ์‰ฝ์ง€๋งŒ, ๋“œ๋ฌธ interaction์€ ๋ฐ์ดํ„ฐ๊ฐ€ ๋ถ€์กฑํ•˜๋‹ค.

๊ทธ๋ž˜์„œ ๋‹จ์ˆœํžˆ ๊ฐ pair์— ๋Œ€ํ•ด class label์„ ์™ธ์šฐ๋Š” ๋ฐฉ์‹์œผ๋กœ๋Š” ํ•œ๊ณ„๊ฐ€ ์žˆ๋‹ค.


๐Ÿง  GroupHOI์˜ ํ•ต์‹ฌ ์•„์ด๋””์–ด

GroupHOI๋Š” ์ œ๋ชฉ ๊ทธ๋Œ€๋กœ Human-Object Interaction์„ Groups๋กœ ํ•™์Šตํ•œ๋‹ค.

์—ฌ๊ธฐ์„œ group์€ ๋‹จ์ˆœํžˆ ์‚ฌ๋žŒ๊ณผ ๊ฐ์ฒด ํ•˜๋‚˜์˜ pair๋งŒ ์˜๋ฏธํ•˜์ง€ ์•Š๋Š”๋‹ค.
์ด๋ฏธ์ง€ ์•ˆ์—์„œ ์ƒํ˜ธ์ž‘์šฉ์ ์œผ๋กœ ์—ฐ๊ฒฐ๋œ ์‚ฌ๋žŒ, ๊ฐ์ฒด, ํ–‰๋™ ๋‹จ์„œ๋“ค์„ ํ•˜๋‚˜์˜ ๊ตฌ์กฐ๋กœ ๋ฐ”๋ผ๋ณด๋Š” ๊ด€์ ์ด๋‹ค.

์˜ˆ๋ฅผ ๋“ค์–ด,

  • ํ•œ ์‚ฌ๋žŒ์ด ์—ฌ๋Ÿฌ ๊ฐ์ฒด์™€ ์ƒํ˜ธ์ž‘์šฉํ•˜๋Š” ๊ฒฝ์šฐ
  • ์—ฌ๋Ÿฌ ์‚ฌ๋žŒ์ด ํ•˜๋‚˜์˜ ๊ฐ์ฒด์™€ ๊ด€๋ จ๋˜๋Š” ๊ฒฝ์šฐ
  • ํ•˜๋‚˜์˜ ํ™œ๋™ ์žฅ๋ฉด ์•ˆ์—์„œ ์—ฌ๋Ÿฌ interaction์ด ๋™์‹œ์— ๋ฐœ์ƒํ•˜๋Š” ๊ฒฝ์šฐ
  • ์ฃผ๋ณ€ ๊ฐ์ฒด๋“ค์ด ํŠน์ • action์„ ์ดํ•ดํ•˜๋Š” ๋ฐ ๋ฌธ๋งฅ์„ ์ œ๊ณตํ•˜๋Š” ๊ฒฝ์šฐ

๋ฅผ group ๋‹จ์œ„๋กœ ๋‹ค๋ฃจ๋Š” ๊ฒƒ์ด ํ•ต์‹ฌ์ด๋‹ค.

์ฆ‰, GroupHOI๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๋ฐฉํ–ฅ์„ ์ง€ํ–ฅํ•œ๋‹ค๊ณ  ๋ณผ ์ˆ˜ ์žˆ๋‹ค.

๊ฐœ๋ณ„ pair classification์—์„œ ๋ฒ—์–ด๋‚˜, ์žฅ๋ฉด ์•ˆ์˜ interaction structure๋ฅผ group representation์œผ๋กœ ํ•™์Šตํ•˜์ž.


๐Ÿ–ผ๏ธ Pair ์ค‘์‹ฌ HOI์™€ Group ์ค‘์‹ฌ HOI์˜ ์ฐจ์ด

๊ธฐ์กด pair ์ค‘์‹ฌ ๊ด€์ ์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

1
2
3
person 1 + object A โ†’ interaction 1
person 1 + object B โ†’ interaction 2
person 2 + object C โ†’ interaction 3

์ด ๋ฐฉ์‹์€ ์ดํ•ดํ•˜๊ธฐ ์‰ฝ์ง€๋งŒ, ๊ฐ interaction์ด ์„œ๋กœ ๋…๋ฆฝ์ ์œผ๋กœ ์ฒ˜๋ฆฌ๋˜๋Š” ๊ฒฝํ–ฅ์ด ์žˆ๋‹ค.

๋ฐ˜๋ฉด group ์ค‘์‹ฌ ๊ด€์ ์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

1
2
3
4
5
6
7
8
9
10
11
12
13
interaction group 1
  - person 1
  - object A
  - object B
  - action context
  - spatial relation
  - scene cue

interaction group 2
  - person 2
  - object C
  - nearby objects
  - action context

์ด๋ ‡๊ฒŒ ๋ณด๋ฉด ๋ชจ๋ธ์€ ๋‹จ์ˆœํžˆ pair ํ•˜๋‚˜๋งŒ ๋ณด๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ,
interaction์ด ๋ฐœ์ƒํ•˜๋Š” ์ฃผ๋ณ€ ๊ตฌ์กฐ์™€ context๋ฅผ ํ•จ๊ป˜ ํ•™์Šตํ•  ์ˆ˜ ์žˆ๋‹ค.


๐Ÿ” ์™œ Group์œผ๋กœ ๋ณด๋Š” ๊ฒƒ์ด ์ค‘์š”ํ•œ๊ฐ€?

1. ๊ฐ™์€ ์‚ฌ๋žŒ์€ ์—ฌ๋Ÿฌ ๊ฐ์ฒด์™€ ๋™์‹œ์— ์ƒํ˜ธ์ž‘์šฉํ•  ์ˆ˜ ์žˆ๋‹ค

์˜ˆ๋ฅผ ๋“ค์–ด ์ฃผ๋ฐฉ์—์„œ ์š”๋ฆฌํ•˜๋Š” ์‚ฌ๋žŒ์€ ๋™์‹œ์— ์—ฌ๋Ÿฌ ๊ฐ์ฒด์™€ ๊ด€๋ จ๋œ๋‹ค.

  • ์นผ์„ ์žก๊ณ  ์žˆ์Œ
  • ๋„๋งˆ ์œ„์˜ ์Œ์‹์„ ์ž๋ฆ„
  • ๊ทธ๋ฆ‡์„ ์‚ฌ์šฉํ•จ
  • ์‹ฑํฌ๋Œ€ ์•ž์— ์„œ ์žˆ์Œ

์ด๋•Œ interaction์€ ํ•˜๋‚˜์˜ pair๋กœ ๋๋‚˜์ง€ ์•Š๋Š”๋‹ค.

person-knife, person-food, person-cutting board, person-bowl ๊ฐ™์€ ์—ฌ๋Ÿฌ ๊ด€๊ณ„๊ฐ€ ํ•˜๋‚˜์˜ ํ™œ๋™ group์„ ํ˜•์„ฑํ•œ๋‹ค.

์ด๋Ÿฐ group ๊ตฌ์กฐ๋ฅผ ์ดํ•ดํ•˜๋ฉด ๊ฐœ๋ณ„ interaction ์˜ˆ์ธก๋„ ๋” ์•ˆ์ •์ ์œผ๋กœ ํ•  ์ˆ˜ ์žˆ๋‹ค.


2. ์ฃผ๋ณ€ ๊ฐ์ฒด๊ฐ€ action์„ ํ•ด์„ํ•˜๋Š” ๋‹จ์„œ๊ฐ€ ๋œ๋‹ค

์‚ฌ๋žŒ์ด ์†์„ ์•ž์œผ๋กœ ๋ป—๊ณ  ์žˆ๋Š” ์žฅ๋ฉด์ด ์žˆ๋‹ค๊ณ  ํ•˜์ž.

๊ทธ ์•ž์— ์žˆ๋Š” ๊ฐ์ฒด๊ฐ€ ๋ฌด์—‡์ธ์ง€์— ๋”ฐ๋ผ action ํ•ด์„์€ ๋‹ฌ๋ผ์ง„๋‹ค.

  • ์ปต์ด ์žˆ์œผ๋ฉด hold cup ๋˜๋Š” drink from cup
  • ๋ฌธ ์†์žก์ด๊ฐ€ ์žˆ์œผ๋ฉด open door
  • ํ‚ค๋ณด๋“œ๊ฐ€ ์žˆ์œผ๋ฉด type on keyboard
  • ๊ณต์ด ์žˆ์œผ๋ฉด throw ball ๋˜๋Š” catch ball

์ฆ‰, action์€ ์‚ฌ๋žŒ์˜ pose๋งŒ์œผ๋กœ ๊ฒฐ์ •๋˜์ง€ ์•Š๊ณ , ์ฃผ๋ณ€ ๊ฐ์ฒด ๋ฐ ์žฅ๋ฉด ๋ฌธ๋งฅ๊ณผ ํ•จ๊ป˜ ๊ฒฐ์ •๋œ๋‹ค.

GroupHOI๋Š” ์ด๋Ÿฐ ๋งฅ๋ฝ์„ group ๋‹จ์œ„๋กœ ๋ณด๋ ค๋Š” ์ ‘๊ทผ์ด๋‹ค.


3. ์—ฌ๋Ÿฌ ์‚ฌ๋žŒ์˜ ๊ณต๋™ ํ™œ๋™์„ ์ดํ•ดํ•  ์ˆ˜ ์žˆ๋‹ค

HOI Detection์—์„œ๋Š” ์—ฌ๋Ÿฌ ์‚ฌ๋žŒ์ด ํ•จ๊ป˜ ํ•˜๋‚˜์˜ ๊ฐ์ฒด๋‚˜ ํ™œ๋™์— ์ฐธ์—ฌํ•˜๋Š” ์žฅ๋ฉด๋„ ์ค‘์š”ํ•˜๋‹ค.

์˜ˆ๋ฅผ ๋“ค์–ด,

  • ์—ฌ๋Ÿฌ ์‚ฌ๋žŒ์ด ๋ณดํŠธ๋ฅผ ํƒ€๋Š” ์žฅ๋ฉด
  • ๋‘ ์‚ฌ๋žŒ์ด ํ…Œ๋‹ˆ์Šค๋ฅผ ์น˜๋Š” ์žฅ๋ฉด
  • ์‚ฌ๋žŒ๋“ค์ด ์‹ํƒ์—์„œ ํ•จ๊ป˜ ์‹์‚ฌํ•˜๋Š” ์žฅ๋ฉด
  • ์—ฌ๋Ÿฌ ์‚ฌ๋žŒ์ด ํฐ ๋ฌผ์ฒด๋ฅผ ์˜ฎ๊ธฐ๋Š” ์žฅ๋ฉด

์ด๋Ÿฐ ๊ฒฝ์šฐ ๊ฐ๊ฐ์˜ human-object pair๋ฅผ ๋…๋ฆฝ์ ์œผ๋กœ๋งŒ ๋ณด๋ฉด, ๊ณต๋™ ํ™œ๋™์˜ ๋งฅ๋ฝ์„ ๋†“์น  ์ˆ˜ ์žˆ๋‹ค.

Group ๊ธฐ๋ฐ˜ ๋ชจ๋ธ๋ง์€ ์ด๋Ÿฐ multi-person, multi-object interaction์„ ๋” ์ž์—ฐ์Šค๋Ÿฝ๊ฒŒ ํ‘œํ˜„ํ•  ์ˆ˜ ์žˆ๋‹ค.


๐Ÿ” GitHub README์—์„œ ํ™•์ธ๋˜๋Š” ๊ตฌํ˜„ ์ •๋ณด

์ œ๊ณต๋œ GitHub README ๊ธฐ์ค€์œผ๋กœ GroupHOI ๊ตฌํ˜„์€ ๋‹ค์Œ ์š”์†Œ๋“ค์„ ํฌํ•จํ•œ๋‹ค.

1. ์‚ฌ์šฉ ๋ฐ์ดํ„ฐ์…‹

README์—์„œ ๊ณต์‹์ ์œผ๋กœ ์–ธ๊ธ‰๋˜๋Š” ๋ฐ์ดํ„ฐ์…‹์€ ๋‹ค์Œ ๋‘ ๊ฐ€์ง€๋‹ค.

  • HICO-DET
  • V-COCO

HICO-DET๋Š” HOI Detection์—์„œ ๊ฐ€์žฅ ๋„๋ฆฌ ์“ฐ์ด๋Š” ๋ฒค์น˜๋งˆํฌ ์ค‘ ํ•˜๋‚˜๋‹ค.
V-COCO ์—ญ์‹œ ์‚ฌ๋žŒ์˜ ํ–‰๋™๊ณผ ๊ฐ์ฒด ๊ฐ„ ์ƒํ˜ธ์ž‘์šฉ์„ ํ‰๊ฐ€ํ•˜๋Š” ๋Œ€ํ‘œ์ ์ธ ๋ฐ์ดํ„ฐ์…‹์ด๋‹ค.


2. Backbone ๋ฐ detector ์ดˆ๊ธฐํ™”

README์—์„œ๋Š” DETR detector์˜ pretrained model์„ ๋‹ค์šด๋กœ๋“œํ•ด์„œ ์‚ฌ์šฉํ•˜๋„๋ก ์•ˆ๋‚ดํ•œ๋‹ค.

  • DETR ResNet-50 pretrained model
  • detr-r50-e632da11.pth
  • num_queries 64

HICO-DET์™€ V-COCO์šฉ์œผ๋กœ ๊ฐ๊ฐ parameter conversion์„ ์ˆ˜ํ–‰ํ•œ๋‹ค.

  • detr-r50-pre-2branch-hico.pth
  • detr-r50-pre-2branch-vcoco.pth

์ฆ‰, GroupHOI๋Š” DETR ๊ณ„์—ด์˜ object detection / query ๊ธฐ๋ฐ˜ ๊ตฌ์กฐ ์œ„์—์„œ HOI detection์„ ์ˆ˜ํ–‰ํ•˜๋Š” ํ๋ฆ„์œผ๋กœ ์ดํ•ดํ•  ์ˆ˜ ์žˆ๋‹ค.


3. CLIP ์‚ฌ์šฉ

README์—๋Š” ๋‹ค์Œ ์„ค์น˜ ๊ณผ์ •์ด ํฌํ•จ๋˜์–ด ์žˆ๋‹ค.

1
git clone https://github.com/openai/CLIP.git && cd CLIP && python setup.py develop && cd ..

๋˜ํ•œ evaluation ๋ช…๋ น์–ด์—์„œ๋„ ๋‹ค์Œ ์˜ต์…˜๋“ค์ด ๋ณด์ธ๋‹ค.

  • --with_clip_label
  • --with_obj_clip_label

์ด๋กœ ๋ณด์•„ GroupHOI๋Š” HOI label ๋˜๋Š” object label ์ชฝ์— CLIP ๊ธฐ๋ฐ˜ semantic ์ •๋ณด๋ฅผ ํ™œ์šฉํ•˜๋Š” ๊ตฌํ˜„์„ ํฌํ•จํ•˜๋Š” ๊ฒƒ์œผ๋กœ ๋ณด์ธ๋‹ค.

๋‹ค๋งŒ README๋งŒ์œผ๋กœ๋Š” CLIP์ด ์ •ํ™•ํžˆ ์–ด๋–ค loss๋‚˜ module์—์„œ ์–ด๋–ป๊ฒŒ ์‚ฌ์šฉ๋˜๋Š”์ง€๊นŒ์ง€๋Š” ์•Œ ์ˆ˜ ์—†๋‹ค.
์ •ํ™•ํ•œ ์„ธ๋ถ€ ๊ตฌ์กฐ๋Š” ๋…ผ๋ฌธ ๋ณธ๋ฌธ ๋˜๋Š” ์ฝ”๋“œ์˜ model ํŒŒ์ผ์„ ํ•จ๊ป˜ ํ™•์ธํ•ด์•ผ ํ•œ๋‹ค.


4. ์ฐธ๊ณ ํ•œ ๊ธฐ์กด ์ฝ”๋“œ๋ฒ ์ด์Šค

README์˜ Acknowledge์— ๋”ฐ๋ฅด๋ฉด ์ผ๋ถ€ ์ฝ”๋“œ๋Š” ๋‹ค์Œ ์—ฐ๊ตฌ/์ €์žฅ์†Œ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•œ๋‹ค.

  • PPDM
  • DETR
  • QPIC
  • CDN
  • GEN-VLKT

์ด ๋ชฉ๋ก๋งŒ ๋ด๋„ GroupHOI๊ฐ€ HOI Detection์˜ ๊ธฐ์กด ์ฃผ์š” ํ๋ฆ„, ํŠนํžˆ DETR/query ๊ธฐ๋ฐ˜ HOI detector ๊ณ„์—ด๊ณผ ์—ฐ๊ฒฐ๋˜์–ด ์žˆ๋‹ค๋Š” ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค.


๐Ÿ” ๊ธฐ์กด HOI Detection ๋ฐฉ์‹์˜ ํ•œ๊ณ„

1. Human-object pair explosion ๋ฌธ์ œ

์ด๋ฏธ์ง€์— ์‚ฌ๋žŒ๊ณผ ๊ฐ์ฒด๊ฐ€ ๋งŽ์•„์งˆ์ˆ˜๋ก ๊ฐ€๋Šฅํ•œ pair ์ˆ˜๋Š” ๋น ๋ฅด๊ฒŒ ์ฆ๊ฐ€ํ•œ๋‹ค.

์‚ฌ๋žŒ์ด N๋ช…, ๊ฐ์ฒด๊ฐ€ M๊ฐœ๋ผ๋ฉด ๊ฐ€๋Šฅํ•œ human-object pair๋Š” ๋Œ€๋žต N ร— M๊ฐœ๊ฐ€ ๋œ๋‹ค.

์˜ˆ๋ฅผ ๋“ค์–ด,

  • ์‚ฌ๋žŒ 5๋ช…
  • ๊ฐ์ฒด 20๊ฐœ

์ด๋ฉด ๊ฐ€๋Šฅํ•œ pair๋Š” 100๊ฐœ๊ฐ€ ๋œ๋‹ค.

์ด ์ค‘ ์‹ค์ œ interaction์ด ์žˆ๋Š” pair๋Š” ์ผ๋ถ€์— ๋ถˆ๊ณผํ•˜๋‹ค.

๊ทธ๋ž˜์„œ ๋ชจ๋ธ์€ ๋งŽ์€ negative pair ์†์—์„œ ์ง„์งœ interaction์„ ์ฐพ์•„์•ผ ํ•œ๋‹ค.

Group ๊ธฐ๋ฐ˜ ์ ‘๊ทผ์€ ๋ชจ๋“  pair๋ฅผ ๋ฌด์ž‘์ • ๋…๋ฆฝ์ ์œผ๋กœ ๋ณด๋Š” ๋Œ€์‹ ,
์˜๋ฏธ ์žˆ๋Š” interaction ํ›„๋ณด๋“ค์„ ๊ตฌ์กฐ์ ์œผ๋กœ ๋ฌถ์–ด ์ƒ๊ฐํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ์žฅ์ ์ด ์žˆ๋‹ค.


2. Pair-level feature๋งŒ์œผ๋กœ๋Š” ์žฅ๋ฉด ์ดํ•ด๊ฐ€ ๋ถ€์กฑํ•˜๋‹ค

Pair-level feature๋Š” ๋ณดํ†ต ๋‹ค์Œ ์ •๋ณด๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค.

  • human box feature
  • object box feature
  • union box feature
  • spatial encoding
  • object class
  • verb classifier

ํ•˜์ง€๋งŒ ์‹ค์ œ interaction์„ ์ดํ•ดํ•˜๋ ค๋ฉด ๋•Œ๋กœ๋Š” pair ๋ฐ”๊นฅ์˜ ์ •๋ณด๊ฐ€ ์ค‘์š”ํ•˜๋‹ค.

์˜ˆ๋ฅผ ๋“ค์–ด eat์„ ํŒ๋‹จํ•˜๋ ค๋ฉด,

  • ์‚ฌ๋žŒ์˜ ์ž… ์ฃผ๋ณ€
  • ์†์˜ ์œ„์น˜
  • ์Œ์‹
  • ์ ‘์‹œ
  • ์‹ํƒ
  • ์ฃผ๋ณ€ ์‹์‚ฌ ์žฅ๋ฉด

์ด ๋ชจ๋‘๊ฐ€ ํžŒํŠธ๊ฐ€ ๋  ์ˆ˜ ์žˆ๋‹ค.

๋”ฐ๋ผ์„œ group-level context๋ฅผ ํ•™์Šตํ•˜๋Š” ๊ฒƒ์ด interaction reasoning์— ๋„์›€์ด ๋  ์ˆ˜ ์žˆ๋‹ค.


3. Long-tail interaction์— ์•ฝํ•˜๋‹ค

HOI dataset์€ long-tail ๋ถ„ํฌ๊ฐ€ ์‹ฌํ•˜๋‹ค.

์ž์ฃผ ๋“ฑ์žฅํ•˜๋Š” interaction์€ ๋ฐ์ดํ„ฐ๊ฐ€ ๋งŽ์ง€๋งŒ, rare interaction์€ ๋งค์šฐ ์ ๋‹ค.

Pair ๋‹จ์œ„ classifier๋Š” rare class์˜ decision boundary๋ฅผ ์ถฉ๋ถ„ํžˆ ๋ฐฐ์šฐ๊ธฐ ์–ด๋ ต๋‹ค.

GroupHOI์ฒ˜๋Ÿผ ๊ตฌ์กฐ์  context๋ฅผ ํ™œ์šฉํ•˜๋ฉด, rare interaction๋„ ์œ ์‚ฌํ•œ group pattern์ด๋‚˜ scene context๋ฅผ ํ†ตํ•ด ๋” ์ž˜ ์ผ๋ฐ˜ํ™”ํ•  ๊ฐ€๋Šฅ์„ฑ์ด ์žˆ๋‹ค.


๐Ÿงญ GroupHOI๋Š” ์–ด๋–ค ๋ฐฉํ–ฅ์˜ ์—ฐ๊ตฌ์ธ๊ฐ€?

GroupHOI๋Š” HOI Detection์„ ๋‹ค์Œ์ฒ˜๋Ÿผ ์žฌํ•ด์„ํ•˜๋Š” ์—ฐ๊ตฌ๋กœ ๋ณผ ์ˆ˜ ์žˆ๋‹ค.

๊ด€์ ๊ธฐ์กด Pair ์ค‘์‹ฌ HOIGroupHOI ๊ด€์ 
๊ธฐ๋ณธ ๋‹จ์œ„Human-object pairHuman-object interaction group
๋ชจ๋ธ๋ง ๋ฐฉ์‹๊ฐ pair๋ฅผ ๋…๋ฆฝ์ ์œผ๋กœ ๋ถ„๋ฅ˜๊ด€๋ จ๋œ ์‚ฌ๋žŒ, ๊ฐ์ฒด, ๋ฌธ๋งฅ์„ ํ•จ๊ป˜ ํ•™์Šต
๊ฐ•์ ๊ตฌ์กฐ๊ฐ€ ๋‹จ์ˆœํ•˜๊ณ  ์ง๊ด€์ ๋ณต์žกํ•œ ์žฅ๋ฉด context๋ฅผ ๋ฐ˜์˜ํ•˜๊ธฐ ์ข‹์Œ
์•ฝ์ pair explosion, context ๋ถ€์กฑgroup ์ •์˜์™€ ํ•™์Šต ์„ค๊ณ„๊ฐ€ ์ค‘์š”
๋ชฉํ‘œ<human, verb, object> ์˜ˆ์ธกgroup-aware interaction representation ํ•™์Šต

ํ•ต์‹ฌ์€ โ€œpair๋ฅผ ๋ฒ„๋ฆฐ๋‹คโ€๊ฐ€ ์•„๋‹ˆ๋‹ค.

์ตœ์ข… ์ถœ๋ ฅ์€ ์—ฌ์ „ํžˆ <human, verb, object> triplet์ผ ์ˆ˜ ์žˆ๋‹ค.
๋‹ค๋งŒ ๊ทธ triplet์„ ์˜ˆ์ธกํ•˜๋Š” ๊ณผ์ •์—์„œ pair ํ•˜๋‚˜๋งŒ ๊ณ ๋ฆฝ์ ์œผ๋กœ ๋ณด๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ,
interaction group์˜ ๋งฅ๋ฝ ์†์—์„œ ์ดํ•ดํ•˜์ž๋Š” ๊ฒƒ์ด๋‹ค.


๐Ÿ” ๋ณธ ์—ฐ๊ตฌ์˜ ๋ฐฉ๋ฒ•๋ก ์„ ์ดํ•ดํ•˜๋Š” ํฌ์ธํŠธ

README๋งŒ์œผ๋กœ๋Š” ๋…ผ๋ฌธ ๋‚ด๋ถ€์˜ ์„ธ๋ถ€ module ์ด๋ฆ„์ด๋‚˜ loss ๊ตฌ์„ฑ๊นŒ์ง€๋Š” ํ™•์ธํ•˜๊ธฐ ์–ด๋ ต๋‹ค.
ํ•˜์ง€๋งŒ ์ œ๋ชฉ, ์ €์žฅ์†Œ ๊ตฌ์กฐ, ์‹คํ–‰ ์˜ต์…˜, ์‚ฌ์šฉ ๋ฐ์ดํ„ฐ์…‹์„ ๊ธฐ์ค€์œผ๋กœ ๋ณด๋ฉด GroupHOI์—์„œ ์ค‘์š”ํ•˜๊ฒŒ ๋ด์•ผ ํ•  ๋ฐฉ๋ฒ•๋ก ์  ํฌ์ธํŠธ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

1. Interaction group์„ ์–ด๋–ป๊ฒŒ ์ •์˜ํ•˜๋Š”๊ฐ€?

๊ฐ€์žฅ ์ค‘์š”ํ•œ ์งˆ๋ฌธ์€ ์ด๊ฒƒ์ด๋‹ค.

โ€œ๋ฌด์—‡์„ ํ•˜๋‚˜์˜ group์œผ๋กœ ๋ณผ ๊ฒƒ์ธ๊ฐ€?โ€

๊ฐ€๋Šฅํ•œ ๊ธฐ์ค€์€ ์—ฌ๋Ÿฌ ๊ฐ€์ง€๋‹ค.

  • ๊ฐ™์€ ์‚ฌ๋žŒ์„ ์ค‘์‹ฌ์œผ๋กœ ์—ฐ๊ฒฐ๋œ ๊ฐ์ฒด๋“ค
  • ๊ฐ™์€ ๊ฐ์ฒด์™€ ๊ด€๋ จ๋œ ์—ฌ๋Ÿฌ ์‚ฌ๋žŒ๋“ค
  • ํ•˜๋‚˜์˜ activity context ์•ˆ์— ์žˆ๋Š” ์‚ฌ๋žŒ๊ณผ ๊ฐ์ฒด๋“ค
  • spatially closeํ•œ human-object ๊ด€๊ณ„
  • semanticํ•˜๊ฒŒ ์—ฐ๊ด€๋œ interaction ํ›„๋ณด๋“ค

Group ์ •์˜๊ฐ€ ์ข‹์•„์•ผ ๋ชจ๋ธ์ด ์‹ค์ œ interaction ๊ตฌ์กฐ๋ฅผ ์ž˜ ๋ฐฐ์šธ ์ˆ˜ ์žˆ๋‹ค.


2. Group feature๋ฅผ ์–ด๋–ป๊ฒŒ ๋งŒ๋“œ๋Š”๊ฐ€?

Group์„ ์ •์˜ํ–ˆ๋‹ค๋ฉด, ๊ทธ๋‹ค์Œ์€ group representation์ด๋‹ค.

๋‹จ์ˆœํžˆ box feature๋ฅผ ํ‰๊ท  ๋‚ด๋Š” ๊ฒƒ๋งŒ์œผ๋กœ๋Š” ๋ถ€์กฑํ•  ์ˆ˜ ์žˆ๋‹ค.

์ข‹์€ group feature๋Š” ๋‹ค์Œ ์ •๋ณด๋ฅผ ๋‹ด์•„์•ผ ํ•œ๋‹ค.

  • ์‚ฌ๋žŒ์˜ appearance
  • ๊ฐ์ฒด์˜ appearance
  • ์‚ฌ๋žŒ๊ณผ ๊ฐ์ฒด์˜ ์ƒ๋Œ€ ์œ„์น˜
  • pose ๋˜๋Š” motion cue
  • scene context
  • ๋‹ค๋ฅธ ๊ฐ์ฒด์™€์˜ ๊ด€๊ณ„
  • group ๋‚ด๋ถ€์˜ interaction consistency

์ฆ‰, group feature๋Š” pair feature๋ณด๋‹ค ๋” ํ’๋ถ€ํ•œ context๋ฅผ ๋‹ด๋Š” representation์ด์–ด์•ผ ํ•œ๋‹ค.


3. Group๊ณผ triplet prediction์„ ์–ด๋–ป๊ฒŒ ์—ฐ๊ฒฐํ•˜๋Š”๊ฐ€?

HOI Detection์˜ ์ตœ์ข… ์ถœ๋ ฅ์€ ๋Œ€์ฒด๋กœ <human, verb, object> triplet์ด๋‹ค.

๊ทธ๋ ‡๋‹ค๋ฉด group-level representation์„ ์ตœ์ข… triplet prediction๊ณผ ์–ด๋–ป๊ฒŒ ์—ฐ๊ฒฐํ• ์ง€๊ฐ€ ์ค‘์š”ํ•˜๋‹ค.

๊ฐ€๋Šฅํ•œ ํ๋ฆ„์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

  1. ์ด๋ฏธ์ง€์—์„œ candidate human/object๋ฅผ ์ฐพ๋Š”๋‹ค.
  2. ๊ด€๋ จ ์žˆ๋Š” human-object ํ›„๋ณด๋“ค์„ group์œผ๋กœ ๋ฌถ๋Š”๋‹ค.
  3. group representation์„ ํ•™์Šตํ•œ๋‹ค.
  4. group ๋‚ด๋ถ€์—์„œ ๊ฐ human-object interaction์„ ์˜ˆ์ธกํ•œ๋‹ค.
  5. ์ตœ์ข… triplet์„ ์ถœ๋ ฅํ•œ๋‹ค.

์ด ๊ณผ์ •์—์„œ group์€ ๋‹จ์ˆœ ๋ณด์กฐ feature๊ฐ€ ์•„๋‹ˆ๋ผ, interaction reasoning์˜ ์ค‘์‹ฌ ๋‹จ์œ„๊ฐ€ ๋œ๋‹ค.


4. CLIP label ์ •๋ณด๋ฅผ ์–ด๋–ป๊ฒŒ ํ™œ์šฉํ•˜๋Š”๊ฐ€?

README์˜ evaluation command์—๋Š” --with_clip_label, --with_obj_clip_label ์˜ต์…˜์ด ๋“ฑ์žฅํ•œ๋‹ค.

์ด๋Š” GroupHOI๊ฐ€ ๋‹จ์ˆœ visual feature๋งŒ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ,
CLIP์—์„œ ์–ป์„ ์ˆ˜ ์žˆ๋Š” label-level semantic ์ •๋ณด๋ฅผ ํ•จ๊ป˜ ์‚ฌ์šฉํ•˜๋Š” ๋ฐฉํ–ฅ์ผ ๊ฐ€๋Šฅ์„ฑ์„ ๋ณด์—ฌ์ค€๋‹ค.

HOI Detection์—์„œ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์ด์œ ๋กœ CLIP semantic์ด ์œ ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค.

  • ride bicycle๊ณผ ride horse์ฒ˜๋Ÿผ ๋น„์Šทํ•œ action ๊ตฌ์กฐ๋ฅผ ๊ฐ€์ง„ class๋ฅผ ๊ฐ€๊น๊ฒŒ ํ‘œํ˜„ํ•  ์ˆ˜ ์žˆ๋‹ค.
  • hold cup๊ณผ hold bottle์ฒ˜๋Ÿผ object๋Š” ๋‹ค๋ฅด์ง€๋งŒ ์œ ์‚ฌํ•œ interaction์„ ๊ณต์œ ํ•˜๋Š” class์— ๋„์›€์„ ์ค„ ์ˆ˜ ์žˆ๋‹ค.
  • rare class์˜ classifier ํ•™์Šต์„ language prior๋กœ ๋ณด์™„ํ•  ์ˆ˜ ์žˆ๋‹ค.
  • object label๊ณผ verb label ์‚ฌ์ด์˜ ์˜๋ฏธ ๊ด€๊ณ„๋ฅผ ๋” ์ž˜ ๋ฐ˜์˜ํ•  ์ˆ˜ ์žˆ๋‹ค.

๋‹ค๋งŒ ๊ตฌ์ฒด์ ์ธ CLIP ์‚ฌ์šฉ ๋ฐฉ์‹์€ ๋…ผ๋ฌธ ๋ณธ๋ฌธ ๋˜๋Š” ์ฝ”๋“œ์˜ model ๊ตฌํ˜„์„ ํ™•์ธํ•ด์•ผ ์ •ํ™•ํžˆ ๋งํ•  ์ˆ˜ ์žˆ๋‹ค.


๐Ÿงฉ GroupHOI์˜ ํ•ต์‹ฌ Contribution ์ •๋ฆฌ

๊ตฌ๋ถ„๋‚ด์šฉ
๋ฐœํ‘œNeurIPS 2025
๋ฌธ์ œ ์„ค์ •Human-Object Interaction Detection
ํ•ต์‹ฌ ๊ด€์ HOI๋ฅผ ๋…๋ฆฝ์ ์ธ pair๊ฐ€ ์•„๋‹ˆ๋ผ interaction group์œผ๋กœ ํ•™์Šต
๊ตฌํ˜„ ๊ธฐ๋ฐ˜DETR R50, query ๊ธฐ๋ฐ˜ HOI detection ํ๋ฆ„
์‚ฌ์šฉ ๋ฐ์ดํ„ฐ์…‹HICO-DET, V-COCO
Semantic ์ •๋ณดREADME ๊ธฐ์ค€ CLIP label / object CLIP label ์˜ต์…˜ ์‚ฌ์šฉ
๊ธฐ๋Œ€ ์žฅ์ context-aware reasoning, group-aware interaction representation, rare interaction ์ผ๋ฐ˜ํ™”
์˜์˜HOI Detection์„ ๊ตฌ์กฐ์  ๊ด€๊ณ„ ํ•™์Šต ๋ฌธ์ œ๋กœ ๋” ํ™•์žฅ

๐Ÿงช README ๊ธฐ์ค€ ์‹คํ—˜ ๊ฒฐ๊ณผ

GitHub README์—๋Š” Regular HOI Detection Results๊ฐ€ ๊ณต๊ฐœ๋˜์–ด ์žˆ๋‹ค.

1. HICO-DET ๊ฒฐ๊ณผ

ModelFull (D)Rare (D)Non-rare (D)Full (KO)Rare (KO)Non-rare (KO)
GroupHOI-S (R50)36.7034.8637.2639.4237.7839.91

์—ฌ๊ธฐ์„œ README ๊ธฐ์ค€ ํ‘œ๊ธฐ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

  • D: Default
  • KO: Known Object

๋ˆˆ์— ๋„๋Š” ์ ์€ Rare ์„ฑ๋Šฅ๋„ ๊ฝค ๋†’๊ฒŒ ๋ณด๊ณ ๋˜์–ด ์žˆ๋‹ค๋Š” ๊ฒƒ์ด๋‹ค.

  • Rare (D): 34.86
  • Rare (KO): 37.78

HOI Detection์—์„œ๋Š” rare class ์„ฑ๋Šฅ์ด ๋งค์šฐ ์ค‘์š”ํ•˜๋‹ค.
๋ฐ์ดํ„ฐ๊ฐ€ ์ ์€ interaction์„ ์–ผ๋งˆ๋‚˜ ์ž˜ ์˜ˆ์ธกํ•˜๋Š”์ง€๊ฐ€ ๋ชจ๋ธ์˜ ์ผ๋ฐ˜ํ™” ๋Šฅ๋ ฅ์„ ๋ณด์—ฌ์ฃผ๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค.


2. V-COCO ๊ฒฐ๊ณผ

ModelScenario 1Scenario 2
GroupHOI-S (R50)65.066.0

V-COCO์—์„œ๋„ GroupHOI-S (R50)์˜ ๊ฒฐ๊ณผ๊ฐ€ ๊ณต๊ฐœ๋˜์–ด ์žˆ๋‹ค.

V-COCO๋Š” HICO-DET์™€ ํ‰๊ฐ€ ๋ฐฉ์‹์ด ๋‹ค๋ฅด์ง€๋งŒ,
์‚ฌ๋žŒ์˜ ํ–‰๋™๊ณผ ๊ฐ์ฒด ๊ฐ„ ๊ด€๊ณ„๋ฅผ ํ‰๊ฐ€ํ•œ๋‹ค๋Š” ์ ์—์„œ HOI detector์˜ ์„ฑ๋Šฅ์„ ๋ณด๋Š” ์ค‘์š”ํ•œ benchmark๋‹ค.


๐Ÿงช ์‹คํ—˜ ๊ฒฐ๊ณผ๋ฅผ ๋ณผ ๋•Œ ์ฒดํฌํ•  ๋ถ€๋ถ„

GroupHOI๋ฅผ ์ฝ์„ ๋•Œ๋Š” ๋‹จ์ˆœํžˆ ์ „์ฒด mAP๋งŒ ๋ณด๋Š” ๊ฒƒ๋ณด๋‹ค ๋‹ค์Œ ํฌ์ธํŠธ๋ฅผ ํ•จ๊ป˜ ๋ณด๋ฉด ์ข‹๋‹ค.

1. Rare class ์„ฑ๋Šฅ์ด ์–ผ๋งˆ๋‚˜ ์ข‹์•„์ง€๋Š”๊ฐ€?

HOI Detection์€ long-tail ๋ฌธ์ œ๊ฐ€ ์‹ฌํ•˜๋‹ค.

๋”ฐ๋ผ์„œ ์ „์ฒด Full mAP๋ฟ ์•„๋‹ˆ๋ผ Rare mAP๊ฐ€ ์ค‘์š”ํ•˜๋‹ค.

GroupHOI README ๊ธฐ์ค€ HICO-DET ๊ฒฐ๊ณผ์—์„œ Rare ์„ฑ๋Šฅ์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

  • Rare (D): 34.86
  • Rare (KO): 37.78

์ด ์ˆ˜์น˜๊ฐ€ ๊ธฐ์กด ๋ฐฉ๋ฒ• ๋Œ€๋น„ ์–ผ๋งˆ๋‚˜ ๊ฐœ์„ ๋˜์—ˆ๋Š”์ง€, ๊ทธ๋ฆฌ๊ณ  group modeling์ด rare class์— ์–ด๋–ค ์˜ํ–ฅ์„ ์ฃผ๋Š”์ง€๊ฐ€ ๋…ผ๋ฌธ์—์„œ ๊ฐ€์žฅ ์ค‘์š”ํ•˜๊ฒŒ ๋ณผ ๋ถ€๋ถ„์ด๋‹ค.


2. Group modeling์ด ์‹ค์ œ๋กœ ํšจ๊ณผ๊ฐ€ ์žˆ๋Š”๊ฐ€?

Ablation study์—์„œ ๋ด์•ผ ํ•  ํ•ต์‹ฌ์€ ๋‹ค์Œ์ด๋‹ค.

  • group module์„ ์ œ๊ฑฐํ•˜๋ฉด ์„ฑ๋Šฅ์ด ๋–จ์–ด์ง€๋Š”๊ฐ€?
  • pair-only ๋ฐฉ์‹๊ณผ ๋น„๊ตํ–ˆ์„ ๋•Œ ๊ฐœ์„ ์ด ์žˆ๋Š”๊ฐ€?
  • group ํฌ๊ธฐ๋‚˜ group ๊ตฌ์„ฑ ๋ฐฉ์‹์— ๋”ฐ๋ผ ์„ฑ๋Šฅ์ด ๋‹ฌ๋ผ์ง€๋Š”๊ฐ€?
  • group context๊ฐ€ rare interaction์— ๋” ๋„์›€์ด ๋˜๋Š”๊ฐ€?
  • multi-person/multi-object ์žฅ๋ฉด์—์„œ ๊ฐœ์„ ์ด ํฐ๊ฐ€?

์ด๋Ÿฐ ๋ถ„์„์ด ์žˆ์–ด์•ผ โ€œgroup์œผ๋กœ ๋ณธ๋‹คโ€๋Š” ์•„์ด๋””์–ด๊ฐ€ ์‹ค์ œ๋กœ ์œ ํšจํ•˜๋‹ค๋Š” ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค.


3. CLIP label ์ •๋ณด๊ฐ€ ์–ผ๋งˆ๋‚˜ ๋„์›€์ด ๋˜๋Š”๊ฐ€?

README ๋ช…๋ น์–ด์— CLIP ๊ด€๋ จ ์˜ต์…˜์ด ์žˆ๋Š” ๋งŒํผ, ๋‹ค์Œ ablation๋„ ์ค‘์š”ํ•˜๋‹ค.

  • with_clip_label์„ ์ œ๊ฑฐํ–ˆ์„ ๋•Œ ์„ฑ๋Šฅ ๋ณ€ํ™”
  • with_obj_clip_label์„ ์ œ๊ฑฐํ–ˆ์„ ๋•Œ ์„ฑ๋Šฅ ๋ณ€ํ™”
  • CLIP semantic์ด rare class์— ๋” ํฐ ๋„์›€์„ ์ฃผ๋Š”์ง€ ์—ฌ๋ถ€
  • object label semantic๊ณผ interaction group modeling์ด ์–ด๋–ป๊ฒŒ ๊ฒฐํ•ฉ๋˜๋Š”์ง€

HOI class๋Š” verb-object ์กฐํ•ฉ์œผ๋กœ ์ด๋ฃจ์–ด์ ธ ์žˆ๊ธฐ ๋•Œ๋ฌธ์—,
CLIP์˜ language/semantic prior๊ฐ€ ์ž˜ ๋“ค์–ด๊ฐ€๋ฉด long-tail ๋ฌธ์ œ์— ๋„์›€์ด ๋  ๊ฐ€๋Šฅ์„ฑ์ด ์žˆ๋‹ค.


4. ๊ณ„์‚ฐ๋Ÿ‰์€ ์–ผ๋งˆ๋‚˜ ์ฆ๊ฐ€ํ•˜๋Š”๊ฐ€?

Group modeling์€ context๋ฅผ ํ’๋ถ€ํ•˜๊ฒŒ ๋งŒ๋“ค ์ˆ˜ ์žˆ์ง€๋งŒ, ๊ณ„์‚ฐ๋Ÿ‰์ด ๋Š˜์–ด๋‚  ์ˆ˜๋„ ์žˆ๋‹ค.

๊ทธ๋ž˜์„œ ๋‹ค์Œ๋„ ์ค‘์š”ํ•˜๋‹ค.

  • inference ์†๋„
  • memory ์‚ฌ์šฉ๋Ÿ‰
  • group ์ƒ์„ฑ ๋น„์šฉ
  • pair enumeration ๋Œ€๋น„ ํšจ์œจ์„ฑ
  • DETR query ์ˆ˜์™€ ์„ฑ๋Šฅ์˜ ๊ด€๊ณ„
  • backbone์ด๋‚˜ detector์— ๋…๋ฆฝ์ ์œผ๋กœ ์ ์šฉ ๊ฐ€๋Šฅํ•œ์ง€

์ข‹์€ group-based HOI ๋ชจ๋ธ์ด๋ผ๋ฉด ์„ฑ๋Šฅ๋ฟ ์•„๋‹ˆ๋ผ ํšจ์œจ์„ฑ๋„ ํ•จ๊ป˜ ๊ณ ๋ คํ•ด์•ผ ํ•œ๋‹ค.


๐Ÿ”ฅ ์™œ ์ด ๋…ผ๋ฌธ์ด ์ค‘์š”ํ•œ๊ฐ€?

HOI Detection์€ ๋‹จ์ˆœํžˆ ๊ฐ์ฒด๋ฅผ ์ฐพ๋Š” ๋ฌธ์ œ๊ฐ€ ์•„๋‹ˆ๋‹ค.

์‚ฌ๋žŒ์ด ๊ฐ์ฒด์™€ ์–ด๋–ค ๊ด€๊ณ„๋ฅผ ๋งบ๊ณ  ์žˆ๋Š”์ง€ ์ดํ•ดํ•ด์•ผ ํ•œ๋‹ค.

์ด๋Š” ๊ณง ์žฅ๋ฉด ์ดํ•ด(scene understanding), ํ–‰๋™ ์ธ์‹(action understanding), ๊ด€๊ณ„ ์ถ”๋ก (relation reasoning)์ด ๋ชจ๋‘ ๊ฒฐํ•ฉ๋œ ๋ฌธ์ œ๋‹ค.

GroupHOI๊ฐ€ ์ค‘์š”ํ•œ ์ด์œ ๋Š” HOI Detection์˜ ๊ธฐ๋ณธ ๋‹จ์œ„๋ฅผ ๋‹ค์‹œ ์ƒ๊ฐํ•˜๊ฒŒ ๋งŒ๋“ค๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค.

๊ธฐ์กด ๋ฐฉ์‹์€ ๋Œ€๋ถ€๋ถ„ ๋‹ค์Œ์ฒ˜๋Ÿผ ์ƒ๊ฐํ–ˆ๋‹ค.

โ€œ์‚ฌ๋žŒ ํ•˜๋‚˜์™€ ๊ฐ์ฒด ํ•˜๋‚˜๋ฅผ pair๋กœ ๋งŒ๋“ค๊ณ , ๊ทธ pair์˜ interaction์„ ๋ถ„๋ฅ˜ํ•˜์ž.โ€

ํ•˜์ง€๋งŒ ์‹ค์ œ ์ด๋ฏธ์ง€๋Š” ํ›จ์”ฌ ๋ณต์žกํ•˜๋‹ค.

  • ํ•œ ์‚ฌ๋žŒ์ด ์—ฌ๋Ÿฌ ๊ฐ์ฒด์™€ ์ƒํ˜ธ์ž‘์šฉํ•œ๋‹ค.
  • ์—ฌ๋Ÿฌ ์‚ฌ๋žŒ์ด ๊ฐ™์€ ๊ฐ์ฒด์™€ ๊ด€๋ จ๋œ๋‹ค.
  • ์ฃผ๋ณ€ ๊ฐ์ฒด๋“ค์ด action์„ ํ•ด์„ํ•˜๋Š” ๋ฐ ์ค‘์š”ํ•œ ๋‹จ์„œ๊ฐ€ ๋œ๋‹ค.
  • ์žฅ๋ฉด ์ „์ฒด์˜ activity context๊ฐ€ verb prediction์— ์˜ํ–ฅ์„ ์ค€๋‹ค.

๋”ฐ๋ผ์„œ group ๋‹จ์œ„๋กœ interaction์„ ํ•™์Šตํ•˜๋Š” ๊ฒƒ์€ ์ž์—ฐ์Šค๋Ÿฌ์šด ํ™•์žฅ์ด๋‹ค.

ํŠนํžˆ ๋กœ๋ณดํ‹ฑ์Šค, ์ž์œจ์ฃผํ–‰, ์˜์ƒ ์ดํ•ด, ๊ฐ์‹œ ์‹œ์Šคํ…œ, AR/VR ๊ฐ™์€ ์‘์šฉ์—์„œ๋Š” ๋‹จ์ผ pair๋ณด๋‹ค ์žฅ๋ฉด ์† ๊ด€๊ณ„ ๊ตฌ์กฐ๋ฅผ ์ดํ•ดํ•˜๋Š” ๊ฒƒ์ด ์ค‘์š”ํ•˜๋‹ค.


๐Ÿง  ๊ฐœ์ธ์ ์ธ ์ดํ•ด ํฌ์ธํŠธ

์ด ๋…ผ๋ฌธ์—์„œ ๊ฐ€์žฅ ํฅ๋ฏธ๋กœ์šด ๋ถ€๋ถ„์€ HOI Detection์„ โ€œ๊ด€๊ณ„๋“ค์˜ ์ง‘ํ•ฉโ€์œผ๋กœ ๋ณด๋Š” ๊ด€์ ์ด๋‹ค.

์‚ฌ๋žŒ๊ณผ ๊ฐ์ฒด ํ•˜๋‚˜์˜ pair๋งŒ ๋ณด๋ฉด ์• ๋งคํ•œ ์žฅ๋ฉด๋„, group์œผ๋กœ ๋ณด๋ฉด ํ›จ์”ฌ ๋ช…ํ™•ํ•ด์งˆ ์ˆ˜ ์žˆ๋‹ค.

์˜ˆ๋ฅผ ๋“ค์–ด ์‚ฌ๋žŒ์ด ์†์„ ๋ป—๊ณ  ์žˆ๋Š” ์žฅ๋ฉด์—์„œ,

  • ์ปต ํ•˜๋‚˜๋งŒ ์žˆ์œผ๋ฉด hold cup
  • ์ ‘์‹œ์™€ ์Œ์‹์ด ํ•จ๊ป˜ ์žˆ์œผ๋ฉด eat
  • ์‹ฑํฌ๋Œ€์™€ ์ ‘์‹œ๊ฐ€ ์žˆ์œผ๋ฉด wash
  • ๋ฌธ ์†์žก์ด๊ฐ€ ์žˆ์œผ๋ฉด open
  • ํ‚ค๋ณด๋“œ์™€ ๋ชจ๋‹ˆํ„ฐ๊ฐ€ ์žˆ์œผ๋ฉด type

์ฒ˜๋Ÿผ ์ฃผ๋ณ€ group context๊ฐ€ action ํ•ด์„์— ํฐ ์˜ํ–ฅ์„ ์ค€๋‹ค.

์ฆ‰, interaction์€ pair ์•ˆ์—๋งŒ ์กด์žฌํ•˜๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ, ์žฅ๋ฉด ์ „์ฒด ๊ตฌ์กฐ ์†์—์„œ ์˜๋ฏธ๊ฐ€ ๊ฒฐ์ •๋œ๋‹ค.

GroupHOI๋Š” ์ด ์ ์„ ์ž˜ ํฌ์ฐฉํ•œ ์—ฐ๊ตฌ๋กœ ๋ณผ ์ˆ˜ ์žˆ๋‹ค.

๋˜ํ•œ README์—์„œ CLIP label ๊ด€๋ จ ์˜ต์…˜์ด ๋ณด์ธ๋‹ค๋Š” ์ ๋„ ํฅ๋ฏธ๋กญ๋‹ค.
HOI๋Š” verb + object์˜ ์กฐํ•ฉ์ด๊ธฐ ๋•Œ๋ฌธ์—, ์‹œ๊ฐ ์ •๋ณด๋ฟ ์•„๋‹ˆ๋ผ label semantic์„ ์ž˜ ํ™œ์šฉํ•˜๋Š” ๊ฒƒ์ด ํŠนํžˆ ์ค‘์š”ํ•˜๋‹ค.

Group-level context์™€ CLIP semantic์ด ํ•จ๊ป˜ ์“ฐ์ธ๋‹ค๋ฉด,
๋‹จ์ˆœ pair classifier๋ณด๋‹ค ๋” ํ’๋ถ€ํ•œ interaction representation์„ ๋งŒ๋“ค ์ˆ˜ ์žˆ์„ ๊ฒƒ์ด๋‹ค.


โœ… ๊ฒฐ๋ก 

  • GroupHOI๋Š” NeurIPS 2025์— ๋ฐœํ‘œ๋œ HOI Detection ์—ฐ๊ตฌ๋‹ค.
  • ์ œ๋ชฉ์ฒ˜๋Ÿผ HOI๋ฅผ ๋…๋ฆฝ์ ์ธ pair๊ฐ€ ์•„๋‹ˆ๋ผ group ๋‹จ์œ„๋กœ ํ•™์Šตํ•˜๋ ค๋Š” ๊ด€์ ์„ ์ œ์•ˆํ•œ๋‹ค.
  • ๊ธฐ์กด pair ์ค‘์‹ฌ HOI Detection์€ ๊ตฌ์กฐ๊ฐ€ ๋‹จ์ˆœํ•˜์ง€๋งŒ, ๋ณต์žกํ•œ ์žฅ๋ฉด context๋ฅผ ๋†“์น˜๊ธฐ ์‰ฝ๋‹ค.
  • GroupHOI๋Š” ๊ด€๋ จ๋œ ์‚ฌ๋žŒ, ๊ฐ์ฒด, action context๋ฅผ ํ•˜๋‚˜์˜ interaction group์œผ๋กœ ๋ณด๊ณ  ๋” ํ’๋ถ€ํ•œ ๊ด€๊ณ„ ํ‘œํ˜„์„ ํ•™์Šตํ•˜๋ ค ํ•œ๋‹ค.
  • GitHub README ๊ธฐ์ค€ DETR R50 ๊ธฐ๋ฐ˜ pretrained detector์™€ CLIP ๊ด€๋ จ label ์˜ต์…˜์„ ์‚ฌ์šฉํ•œ๋‹ค.
  • HICO-DET์—์„œ GroupHOI-S (R50)๋Š” Default Full 36.70, Rare 34.86, Known Object Full 39.42๋ฅผ ๋ณด๊ณ ํ•œ๋‹ค.
  • V-COCO์—์„œ๋Š” Scenario 1 65.0, Scenario 2 66.0์„ ๋ณด๊ณ ํ•œ๋‹ค.
  • ํ•ต์‹ฌ์€ โ€œ์ตœ์ข… ์ถœ๋ ฅ์ด triplet์ด๋ƒ ์•„๋‹ˆ๋ƒโ€๊ฐ€ ์•„๋‹ˆ๋ผ, ๊ทธ triplet์„ ์˜ˆ์ธกํ•˜๋Š” ๊ณผ์ •์—์„œ group-aware reasoning์„ ํ•˜๋А๋ƒ์ด๋‹ค.
  • HOI Detection์„ ๋‹จ์ˆœ pair classification์—์„œ ๊ตฌ์กฐ์  scene understanding ๋ฌธ์ œ๋กœ ํ™•์žฅํ•œ๋‹ค๋Š” ์ ์—์„œ ์˜๋ฏธ ์žˆ๋Š” ๋ฐฉํ–ฅ์ด๋‹ค!!

This post is licensed under CC BY 4.0 by the author.