Post

๐Ÿ‘Ÿ SHOE: Open-Vocabulary HOI๋ฅผ ์˜๋ฏธ์ ์œผ๋กœ ํ‰๊ฐ€ํ•˜๊ธฐ (CVPR 2026 Workshop)

๐Ÿ‘Ÿ SHOE: Open-Vocabulary HOI๋ฅผ ์˜๋ฏธ์ ์œผ๋กœ ํ‰๊ฐ€ํ•˜๊ธฐ (CVPR 2026 Workshop)

๐Ÿ‘Ÿ SHOE ๋…ผ๋ฌธ ์ฝ๊ธฐ!

๋…ผ๋ฌธ: SHOE: Semantic HOI Open-Vocabulary Evaluation Metric
์ €์ž: Maja Noack, Qinqian Lei, Taipeng Tian, Bihan Dong, Robby T. Tan, Yixin Chen, John Young, Saijun Zhang, Bo Wang
ํ•™ํšŒ: CVPR 2026 Workshop on Grounding and Reasoning with Vision-Language Models (GRAIL-V)
์ฝ”๋“œ: https://github.com/majnoa/SHOE
ํ•œ ์ค„ ์š”์•ฝ: Open-vocabulary HOI ์˜ˆ์ธก์—์„œ lean on couch์™€ sit on couch์ฒ˜๋Ÿผ ํ‘œํ˜„์€ ๋‹ค๋ฅด์ง€๋งŒ ์˜๋ฏธ์ ์œผ๋กœ ๊ฐ€๊นŒ์šด ๋‹ต์„ ๋ฌด์กฐ๊ฑด ์˜ค๋‹ต ์ฒ˜๋ฆฌํ•˜์ง€ ์•Š๊ธฐ ์œ„ํ•ด, HOI label ๊ฐ„ semantic similarity๋ฅผ ๋ฐ˜์˜ํ•˜๋Š” ํ‰๊ฐ€ metric์„ ์ œ์•ˆํ•œ๋‹ค!!


๐Ÿงฉ ๋จผ์ € HOI Detection๊ณผ Open-Vocabulary ๋ฌธ์ œ๊ฐ€ ๋ญ”๊ฐ€?

HOI Detection์€ Human-Object Interaction Detection์˜ ์ค„์ž„๋ง์ด๋‹ค.

์ด๋ฏธ์ง€ ์•ˆ์—์„œ ๋‹จ์ˆœํžˆ ์‚ฌ๋žŒ๊ณผ ๋ฌผ์ฒด๋ฅผ ์ฐพ๋Š” ๊ฒƒ์„ ๋„˜์–ด์„œ,

  • ์‚ฌ๋žŒ์ด ์–ด๋–ค ๊ฐ์ฒด์™€ ์ƒํ˜ธ์ž‘์šฉํ•˜๋Š”์ง€
  • ๊ทธ ์ƒํ˜ธ์ž‘์šฉ์ด ์–ด๋–ค ํ–‰๋™์ธ์ง€
  • ์ตœ์ข…์ ์œผ๋กœ <person, verb, object> ๊ด€๊ณ„๊ฐ€ ๋ฌด์—‡์ธ์ง€

๋ฅผ ์˜ˆ์ธกํ•˜๋Š” ๋ฌธ์ œ๋‹ค.

์˜ˆ๋ฅผ ๋“ค๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

  • <person, ride, bicycle>
  • <person, hold, cup>
  • <person, sit on, couch>
  • <person, lean on, couch>
  • <person, inspect, laptop>

๊ธฐ์กด HOI benchmark์—์„œ๋Š” ๋ณดํ†ต ๋ฏธ๋ฆฌ ์ •ํ•ด์ง„ class ๋ชฉ๋ก์ด ์žˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด HICO-DET์€ 600๊ฐœ์˜ HOI class๋ฅผ ์ •์˜ํ•ด๋‘๊ณ , ๋ชจ๋ธ์ด ๊ทธ class๋ฅผ ์–ผ๋งˆ๋‚˜ ์ž˜ ๋งžํžˆ๋Š”์ง€ ํ‰๊ฐ€ํ•œ๋‹ค.

ํ•˜์ง€๋งŒ ์š”์ฆ˜์€ CLIP, VLM, MLLM ๊ฐ™์€ ๋ชจ๋ธ ๋•๋ถ„์— ์ƒํ™ฉ์ด ๋‹ฌ๋ผ์กŒ๋‹ค.

๋ชจ๋ธ์ด ๊ผญ ์ •ํ•ด์ง„ label๋งŒ ์ถœ๋ ฅํ•˜๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ, ์ž์—ฐ์–ด๋กœ ๋” ์ž์œ ๋กญ๊ฒŒ interaction์„ ๋งํ•  ์ˆ˜ ์žˆ๋‹ค.

์ด๊ฒƒ์ด open-vocabulary HOI detection์˜ ํ•ต์‹ฌ์ด๋‹ค.

๋ฏธ๋ฆฌ ์ •ํ•ด์ง„ 600๊ฐœ label ์•ˆ์—์„œ๋งŒ ๋งžํžˆ๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ, ์‹ค์ œ ์„ธ๊ณ„์—์„œ ๋“ฑ์žฅํ•  ์ˆ˜ ์žˆ๋Š” ๋‹ค์–‘ํ•œ ์‚ฌ๋žŒ-๊ฐ์ฒด ์ƒํ˜ธ์ž‘์šฉ์„ ๋” ๋„“๊ฒŒ ํ‘œํ˜„ํ•˜๊ณ  ํ‰๊ฐ€ํ•˜์ž!

์ข‹์€ ๋ฐฉํ–ฅ์ด์ง€๋งŒ, ์—ฌ๊ธฐ์„œ ํฐ ๋ฌธ์ œ๊ฐ€ ์ƒ๊ธด๋‹ค.


๐Ÿšจ ๊ธฐ์กด mAP ํ‰๊ฐ€์˜ ๋ฌธ์ œ: ์˜๋ฏธ๊ฐ€ ๋น„์Šทํ•ด๋„ ํ‹€๋ ธ๋‹ค๊ณ  ๋ณธ๋‹ค!

HOI Detection์—์„œ๋Š” ๋ณดํ†ต mAP(mean Average Precision)๋ฅผ ๋งŽ์ด ์‚ฌ์šฉํ•œ๋‹ค.

mAP๋Š” detection ๋ถ„์•ผ์—์„œ ์˜ค๋žซ๋™์•ˆ ์“ฐ์ธ ํ‘œ์ค€ metric์ด๋‹ค. bounding box๊ฐ€ ๋งž๋Š”์ง€ ๋ณด๊ณ , class label์ด ๋งž๋Š”์ง€ ๋ณด๊ณ , confidence score ์ˆœ์„œ๋Œ€๋กœ precision-recall์„ ๊ณ„์‚ฐํ•œ๋‹ค.

๋ฌธ์ œ๋Š” ๊ธฐ์กด mAP๊ฐ€ HOI class๋ฅผ ์™„์ „ํžˆ ๋ถ„๋ฆฌ๋œ discrete label๋กœ ๋ณธ๋‹ค๋Š” ์ ์ด๋‹ค.

์˜ˆ๋ฅผ ๋“ค์–ด ์ •๋‹ต์ด ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค๊ณ  ํ•˜์ž.

  • Ground Truth: sit on couch

๊ทธ๋Ÿฐ๋ฐ ๋ชจ๋ธ์ด ์ด๋ ‡๊ฒŒ ์˜ˆ์ธกํ–ˆ๋‹ค.

  • Prediction: lean on couch

์‚ฌ๋žŒ์ด ๋ณด๊ธฐ์—๋Š” ๊ฝค ๋น„์Šทํ•œ ์ƒํ™ฉ์ผ ์ˆ˜ ์žˆ๋‹ค. ์†ŒํŒŒ์— ์•‰์•„ ๊ธฐ๋Œ€๊ณ  ์žˆ๋Š” ์žฅ๋ฉด์ด๋ผ๋ฉด ๋‘ ํ‘œํ˜„์ด ๋ชจ๋‘ ์–ด๋А ์ •๋„ ํƒ€๋‹นํ•  ์ˆ˜ ์žˆ๋‹ค.

ํ•˜์ง€๋งŒ ๊ธฐ์กด exact-match ๊ธฐ๋ฐ˜ ํ‰๊ฐ€์—์„œ๋Š” label์ด ๋‹ค๋ฅด๊ธฐ ๋•Œ๋ฌธ์— ํ‹€๋ ธ๋‹ค๊ณ  ์ฒ˜๋ฆฌ๋œ๋‹ค.

๋˜ ๋‹ค๋ฅธ ์˜ˆ๋„ ์žˆ๋‹ค.

  • ride bicycle vs cycle bicycle
  • hold cup vs grasp cup
  • look at phone vs inspect phone
  • sit on chair vs rest on chair

์ด๋Ÿฐ ํ‘œํ˜„๋“ค์€ ์™„์ „ํžˆ ๊ฐ™์ง€๋Š” ์•Š์ง€๋งŒ, ์˜๋ฏธ์ ์œผ๋กœ ๊ฐ€๊นŒ์šธ ์ˆ˜ ์žˆ๋‹ค.

Open-vocabulary ๋ชจ๋ธ์„ ํ‰๊ฐ€ํ•  ๋•Œ ์ด๋Ÿฐ ์ฐจ์ด๋ฅผ ๋ชจ๋‘ 0์  ์ฒ˜๋ฆฌํ•˜๋ฉด, ๋ชจ๋ธ์˜ ์‹ค์ œ ์ดํ•ด ๋Šฅ๋ ฅ์„ ๊ณผ์†Œํ‰๊ฐ€ํ•  ์ˆ˜ ์žˆ๋‹ค.

SHOE๋Š” ๋ฐ”๋กœ ์ด ์ง€์ ์„ ๊ฒจ๋ƒฅํ•œ๋‹ค.

โ€œHOI ํ‰๊ฐ€์—์„œ label string์ด ์ •ํ™•ํžˆ ๊ฐ™์•„์•ผ๋งŒ ๋งž์•˜๋‹ค๊ณ  ๋ณผ ๊ฒƒ์ด ์•„๋‹ˆ๋ผ, ์‚ฌ๋žŒ์ฒ˜๋Ÿผ ์˜๋ฏธ์  ์œ ์‚ฌ๋„๋ฅผ ๋ฐ˜์˜ํ•ด์„œ ํ‰๊ฐ€ํ•˜์ž!โ€


๐Ÿ‘Ÿ SHOE์˜ ํ•ต์‹ฌ ์•„์ด๋””์–ด

SHOE๋Š” Semantic HOI Open-Vocabulary Evaluation์˜ ์•ฝ์ž๋‹ค.

ํ•ต์‹ฌ์€ ๋‹จ์ˆœํ•˜๋‹ค.

๊ธฐ์กด ํ‰๊ฐ€๊ฐ€ ์˜ˆ์ธก๊ณผ ์ •๋‹ต์„ ์ด๋ ‡๊ฒŒ ๋ดค๋‹ค๋ฉด,

1
์˜ˆ์ธก label == ์ •๋‹ต label ? 1์  : 0์ 

SHOE๋Š” ์ด๋ ‡๊ฒŒ ๋ณธ๋‹ค.

1
์˜ˆ์ธก HOI์™€ ์ •๋‹ต HOI๊ฐ€ ์˜๋ฏธ์ ์œผ๋กœ ์–ผ๋งˆ๋‚˜ ๊ฐ€๊นŒ์šด๊ฐ€?

์˜ˆ๋ฅผ ๋“ค์–ด,

  • sit on couch์™€ lean on couch๋Š” 0์ ๋ณด๋‹ค๋Š” ๋†’์€ ์ ์ˆ˜๋ฅผ ๋ฐ›์„ ์ˆ˜ ์žˆ๊ณ 
  • sit on couch์™€ eat apple์€ ๊ฑฐ์˜ 0์ ์— ๊ฐ€๊นŒ์›Œ์•ผ ํ•œ๋‹ค.

์ฆ‰, SHOE๋Š” binary correct/incorrect๊ฐ€ ์•„๋‹ˆ๋ผ graded semantic similarity๋ฅผ ํ‰๊ฐ€์— ๋„ฃ๋Š”๋‹ค.

์ด ๋ฐฉ์‹์€ open-vocabulary ๋ชจ๋ธ์—๊ฒŒ ํŠนํžˆ ์ค‘์š”ํ•˜๋‹ค. VLM์ด๋‚˜ MLLM์€ ๊ธฐ์กด dataset label๊ณผ ๋˜‘๊ฐ™์€ ๋‹จ์–ด๋ฅผ ์ถœ๋ ฅํ•˜์ง€ ์•Š์„ ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค.


๐Ÿ” SHOE๋Š” HOI๋ฅผ verb์™€ object๋กœ ๋‚˜๋ˆ  ๋ณธ๋‹ค

SHOE์˜ ์ข‹์€ ์ ์€ HOI๋ฅผ ํ•˜๋‚˜์˜ ํ†ต์งœ label๋กœ๋งŒ ๋ณด์ง€ ์•Š๋Š”๋‹ค๋Š” ๊ฒƒ์ด๋‹ค.

HOI๋Š” ๊ธฐ๋ณธ์ ์œผ๋กœ verb + object ์กฐํ•ฉ์ด๋‹ค.

์˜ˆ๋ฅผ ๋“ค๋ฉด,

  • ride bicycle = verb: ride, object: bicycle
  • sit on couch = verb: sit on, object: couch
  • hold cup = verb: hold, object: cup

SHOE๋Š” ์˜ˆ์ธก HOI์™€ ์ •๋‹ต HOI๋ฅผ ๊ฐ๊ฐ verb component์™€ object component๋กœ ๋ถ„ํ•ดํ•œ๋‹ค.

๊ทธ๋ฆฌ๊ณ  ๋‘ ๋ถ€๋ถ„์˜ semantic similarity๋ฅผ ๋”ฐ๋กœ ๊ณ„์‚ฐํ•œ๋‹ค.

1
2
verb similarity   = ์˜ˆ์ธก verb์™€ ์ •๋‹ต verb๊ฐ€ ์–ผ๋งˆ๋‚˜ ๋น„์Šทํ•œ๊ฐ€?
object similarity = ์˜ˆ์ธก object์™€ ์ •๋‹ต object๊ฐ€ ์–ผ๋งˆ๋‚˜ ๋น„์Šทํ•œ๊ฐ€?

๋งˆ์ง€๋ง‰์œผ๋กœ ๋‘ ๊ฐ’์„ ํ‰๊ท ๋‚ด์„œ ํ•˜๋‚˜์˜ HOI similarity score๋ฅผ ๋งŒ๋“ ๋‹ค.

1
sim(pred, gt) = (verb_sim + object_sim) / 2

์ด๋ ‡๊ฒŒ ํ•˜๋ฉด ๋” ์„ฌ์„ธํ•œ ํ‰๊ฐ€๊ฐ€ ๊ฐ€๋Šฅํ•ด์ง„๋‹ค.

์˜ˆ๋ฅผ ๋“ค์–ด sit on couch์™€ sit on chair๋Š” verb๋Š” ๊ฐ™์ง€๋งŒ object๊ฐ€ ๋‹ค๋ฅด๋‹ค. ๋ฐ˜๋Œ€๋กœ sit on couch์™€ lean on couch๋Š” object๋Š” ๊ฐ™๊ณ  verb๊ฐ€ ๋น„์Šทํ•˜๋‹ค.

๊ธฐ์กด exact-match์—์„œ๋Š” ๋‘˜ ๋‹ค ๊ทธ๋ƒฅ ์˜ค๋‹ต์ด์ง€๋งŒ, SHOE์—์„œ๋Š” ๋‘ ๊ฒฝ์šฐ๋ฅผ ๋‹ค๋ฅด๊ฒŒ ํ‰๊ฐ€ํ•  ์ˆ˜ ์žˆ๋‹ค.


๐Ÿง  Semantic similarity๋Š” ์–ด๋–ป๊ฒŒ ๊ณ„์‚ฐํ•˜๋‚˜?

SHOE๋Š” verb์™€ object์˜ ์˜๋ฏธ ์œ ์‚ฌ๋„๋ฅผ ๊ณ„์‚ฐํ•˜๊ธฐ ์œ„ํ•ด WordNet synset๊ณผ ์—ฌ๋Ÿฌ LLM์˜ ํŒ๋‹จ์„ ํ™œ์šฉํ•œ๋‹ค.

GitHub README ๊ธฐ์ค€์œผ๋กœ repository์—๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์€ annotation ํŒŒ์ผ๋“ค์ด ํฌํ•จ๋˜์–ด ์žˆ๋‹ค.

  • hico_verbs_with_synsets.csv: HICO verb์™€ WordNet synset mapping
  • hico_objects_with_synsets.csv: HICO object์™€ WordNet synset mapping
  • verb_shoe_scores.csv: verb synset pair ๊ฐ„ semantic similarity
  • object_shoe_scores.csv: object synset pair ๊ฐ„ semantic similarity

์ฆ‰, ํ‰๊ฐ€ํ•  ๋•Œ๋งˆ๋‹ค LLM์„ ์ƒˆ๋กœ ํ˜ธ์ถœํ•˜๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ, ๋ฏธ๋ฆฌ ๊ณ„์‚ฐ๋œ similarity table์„ ์‚ฌ์šฉํ•œ๋‹ค.

์ด ๋ฐฉ์‹์€ ์‹ค์šฉ์ ์œผ๋กœ ์ค‘์š”ํ•˜๋‹ค.

ํ‰๊ฐ€ metric์ด ๋งค๋ฒˆ ๋น„์‹ผ LLM inference๋ฅผ ์š”๊ตฌํ•˜๋ฉด ์žฌํ˜„์„ฑ๊ณผ ๋น„์šฉ ๋ฌธ์ œ๊ฐ€ ์ƒ๊ธธ ์ˆ˜ ์žˆ๋‹ค. SHOE๋Š” ๋ฏธ๋ฆฌ ๊ตฌ์ถ•๋œ score table์„ ํ™œ์šฉํ•˜์—ฌ, ๊ธฐ์กด benchmark ํ‰๊ฐ€์ฒ˜๋Ÿผ ๋น„๊ต์  ์•ˆ์ •์ ์œผ๋กœ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋งŒ๋“ ๋‹ค.

README์—์„œ๋Š” SHOE similarity table์ด HICO-DET์˜ 600๊ฐœ class๋ฅผ ๋„˜์–ด์„œ, 3,800๋งŒ ๊ฐœ ์ด์ƒ์˜ semantically related HOI label ์กฐํ•ฉ์œผ๋กœ ํ™•์žฅ๋œ๋‹ค๊ณ  ์„ค๋ช…ํ•œ๋‹ค.

์ฆ‰, ๊ธฐ์กด HICO-DET label space ์œ„์—์„œ ์‹œ์ž‘ํ•˜์ง€๋งŒ, open-vocabulary ์˜ˆ์ธก์„ ๋” ๋„“๊ฒŒ ๋ฐ›์•„๋“ค์ผ ์ˆ˜ ์žˆ๋Š” ํ‰๊ฐ€ ๊ณต๊ฐ„์„ ๋งŒ๋“  ๊ฒƒ์ด๋‹ค.


โš™๏ธ SHOE Matching์€ ์–ด๋–ป๊ฒŒ ๋™์ž‘ํ•˜๋‚˜?

SHOE๋Š” ๋‹จ์ˆœํžˆ text similarity๋งŒ ๋ณด๋Š” metric์ด ์•„๋‹ˆ๋‹ค.

HOI Detection ํ‰๊ฐ€์ด๊ธฐ ๋•Œ๋ฌธ์— localization๋„ ํ•จ๊ป˜ ๋ณธ๋‹ค.

์ „์ฒด ํ๋ฆ„์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

1. Bounding box๋ฅผ ๋จผ์ € ํ™•์ธํ•œ๋‹ค

์˜ˆ์ธกํ•œ human box์™€ object box๊ฐ€ ์ •๋‹ต box์™€ ์ถฉ๋ถ„ํžˆ ๊ฒน์ณ์•ผ ํ•œ๋‹ค.

README ๊ธฐ์ค€์œผ๋กœ ๊ธฐ๋ณธ threshold๋Š” min(IoU_human, IoU_object) >= 0.5๋‹ค.

์ฆ‰, ์‚ฌ๋žŒ์ด ์–ด๋”” ์žˆ๋Š”์ง€, ๊ฐ์ฒด๊ฐ€ ์–ด๋”” ์žˆ๋Š”์ง€ ์™„์ „ํžˆ ํ‹€๋ ธ๋‹ค๋ฉด semantic label์ด ๋น„์Šทํ•ด๋„ ์ข‹์€ ์ ์ˆ˜๋ฅผ ๋ฐ›์„ ์ˆ˜ ์—†๋‹ค.

2. Verb์™€ object๋ฅผ synset์œผ๋กœ ๋งคํ•‘ํ•œ๋‹ค

์˜ˆ์ธก label๊ณผ ์ •๋‹ต label์˜ verb/object๋ฅผ WordNet synset์œผ๋กœ ์—ฐ๊ฒฐํ•œ๋‹ค.

์˜ˆ๋ฅผ ๋“ค์–ด bike์™€ bicycle์ฒ˜๋Ÿผ ํ‘œ๋ฉด ๋‹จ์–ด๊ฐ€ ๋‹ฌ๋ผ๋„ ๊ฐ™์€ ๊ฐœ๋…์œผ๋กœ ์—ฐ๊ฒฐ๋  ์ˆ˜ ์žˆ๋‹ค.

3. Verb similarity์™€ object similarity๋ฅผ ๊ฐ€์ ธ์˜จ๋‹ค

๋ฏธ๋ฆฌ ๊ณ„์‚ฐ๋œ SHOE score table์—์„œ verb pair์™€ object pair์˜ ์œ ์‚ฌ๋„๋ฅผ ๊ฐ€์ ธ์˜จ๋‹ค.

4. ํ•˜๋‚˜์˜ instance similarity๋ฅผ ๋งŒ๋“ ๋‹ค

๋‘ ๊ฐ’์„ ํ‰๊ท ๋‚ด์„œ HOI similarity๋ฅผ ๊ณ„์‚ฐํ•œ๋‹ค.

1
sim(pred, gt) = (verb_sim + object_sim) / 2

5. Soft TP / FP / FN์œผ๋กœ ๊ณ„์‚ฐํ•œ๋‹ค

๊ธฐ์กด ํ‰๊ฐ€๋Š” ๋งž์œผ๋ฉด TP 1๊ฐœ, ํ‹€๋ฆฌ๋ฉด FP 1๊ฐœ์ฒ˜๋Ÿผ ๋”ฑ ์ž˜๋ผ ๊ณ„์‚ฐํ•œ๋‹ค.

SHOE๋Š” matched prediction์ด ์ •๋‹ต๊ณผ 0.8๋งŒํผ ๋น„์Šทํ•˜๋ฉด,

1
2
TP += 0.8
FP += 0.2

์ฒ˜๋Ÿผ softํ•˜๊ฒŒ ๋ฐ˜์˜ํ•œ๋‹ค.

๋ฐ˜๋Œ€๋กœ match๋˜์ง€ ์•Š์€ prediction์€ full FP, match๋˜์ง€ ์•Š์€ ground truth๋Š” full FN์ด ๋œ๋‹ค.

์ด ๋ฐฉ์‹์ด SHOE์˜ ํ•ต์‹ฌ์ด๋‹ค.

์™„์ „ํžˆ ๋งž์ง€๋Š” ์•Š์•˜์ง€๋งŒ ์˜๋ฏธ์ ์œผ๋กœ ๊ฐ€๊นŒ์šด ์˜ˆ์ธก์€ ๋ถ€๋ถ„ ์ ์ˆ˜๋ฅผ ๋ฐ›๊ณ , ์™„์ „ํžˆ ์—‰๋šฑํ•œ ์˜ˆ์ธก์€ ๊ฑฐ์˜ ์ ์ˆ˜๋ฅผ ๋ฐ›์ง€ ๋ชปํ•œ๋‹ค.


๐Ÿ“ SHOE mAP์™€ SHOE mF1

SHOE๋Š” ๋‘ ๊ฐ€์ง€ ํ‰๊ฐ€ ๋ชจ๋“œ๋ฅผ ์ œ๊ณตํ•œ๋‹ค.

SHOE mAP

SHOE mAP๋Š” confidence score๊ฐ€ ์žˆ๋Š” ๋ชจ๋ธ์„ ์œ„ํ•œ ํ‰๊ฐ€๋‹ค.

์ผ๋ฐ˜์ ์ธ HOI detector์ฒ˜๋Ÿผ ๊ฐ prediction์— confidence score๊ฐ€ ์žˆ์œผ๋ฉด, score ์ˆœ์„œ๋Œ€๋กœ precision-recall curve๋ฅผ ๋งŒ๋“ค๊ณ  AP๋ฅผ ๊ณ„์‚ฐํ•  ์ˆ˜ ์žˆ๋‹ค.

๋‹ค๋งŒ ๊ธฐ์กด mAP์ฒ˜๋Ÿผ exact match๋ฅผ ์“ฐ๋Š” ๋Œ€์‹ , SHOE similarity ๊ธฐ๋ฐ˜ soft count๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค.

๊ทธ๋ž˜์„œ structured HOI detector ํ‰๊ฐ€์— ์ž˜ ๋งž๋Š”๋‹ค.

SHOE mF1

SHOE mF1์€ confidence score ์—†์ด๋„ ์“ธ ์ˆ˜ ์žˆ๋Š” ํ‰๊ฐ€๋‹ค.

VLM์ด๋‚˜ MLLM์ฒ˜๋Ÿผ ์ž์—ฐ์–ด ์˜ˆ์ธก์„ ๋‚ด๋Š” ๋ชจ๋ธ์€ ํ•ญ์ƒ ์‹ ๋ขฐ๋„ ์ ์ˆ˜๋ฅผ ๊น”๋”ํ•˜๊ฒŒ ์ œ๊ณตํ•˜์ง€ ์•Š๋Š”๋‹ค.

์ด๋Ÿฐ ๊ฒฝ์šฐ SHOE mF1์ด ์œ ์šฉํ•˜๋‹ค.

๋ชจ๋“  prediction์„ ๋™์ผํ•˜๊ฒŒ ๋ณด๊ณ , semantic similarity ๊ธฐ๋ฐ˜์œผ๋กœ precision, recall, F1์„ ๊ณ„์‚ฐํ•œ๋‹ค.

GitHub README์—์„œ๋„ ๋‘ ๋ชจ๋“œ๋ฅผ ๋‹ค์Œ์ฒ˜๋Ÿผ ๊ตฌ๋ถ„ํ•œ๋‹ค.

  • SHOE mAP: confidence-ranked, score๊ฐ€ ์žˆ๋Š” ๋ชจ๋ธ์šฉ
  • SHOE mF1: confidence-free, score๊ฐ€ ์—†๋Š” open-vocabulary prediction์—๋„ ์‚ฌ์šฉ ๊ฐ€๋Šฅ

์ฆ‰, SHOE๋Š” ๊ธฐ์กด HOI detector์™€ open-ended generative model์„ ๋ชจ๋‘ ํ‰๊ฐ€ํ•˜๋ ค๋Š” metric์ด๋‹ค.


๐Ÿ“Š ์‚ฌ๋žŒ ํŒ๋‹จ๊ณผ ์–ผ๋งˆ๋‚˜ ์ž˜ ๋งž๋‚˜?

SHOE๊ฐ€ ์ฃผ์žฅํ•˜๋Š” ํ•ต์‹ฌ ์‹คํ—˜ ๊ฒฐ๊ณผ๋Š” human judgment์™€์˜ ์ •๋ ฌ์ด๋‹ค.

๋…ผ๋ฌธ๊ณผ README์— ๋”ฐ๋ฅด๋ฉด SHOE๋Š” ํ‰๊ท  human rating๊ณผ 85.73% agreement๋ฅผ ๋‹ฌ์„ฑํ–ˆ๋‹ค.

ํฅ๋ฏธ๋กœ์šด ์ ์€ ์ด๊ฒƒ์ด human inter-annotator agreement์ธ 78.61%๋ณด๋‹ค๋„ ๋†’๊ณ , direct LLM scoring์ด๋‚˜ embedding-based baseline๋ณด๋‹ค๋„ ์ข‹์•˜๋‹ค๋Š” ์ ์ด๋‹ค.

์ด ๊ฒฐ๊ณผ๊ฐ€ ์˜๋ฏธํ•˜๋Š” ๋ฐ”๋Š” ํฌ๋‹ค.

ํ‰๊ฐ€ metric์€ ๊ฒฐ๊ตญ ์‚ฌ๋žŒ์ด ๋ณด๊ธฐ์— ํƒ€๋‹นํ•ด์•ผ ํ•œ๋‹ค.

์‚ฌ๋žŒ์ด ๋ณด๊ธฐ์—๋Š” lean on couch๊ฐ€ sit on couch์™€ ์–ด๋А ์ •๋„ ๋น„์Šทํ•œ๋ฐ, metric์ด 0์ ์œผ๋กœ ์ฒ˜๋ฆฌํ•œ๋‹ค๋ฉด ๊ทธ metric์€ open-vocabulary ํ‰๊ฐ€์— ์ ํ•ฉํ•˜์ง€ ์•Š๋‹ค.

SHOE๋Š” ์—ฌ๋Ÿฌ LLM์˜ ํŒ๋‹จ์„ ํ‰๊ท ํ•˜๊ณ , verb/object๋ฅผ ๋ถ„๋ฆฌํ•˜๊ณ , HOI ๊ตฌ์กฐ์— ๋งž๊ฒŒ similarity๋ฅผ ์กฐํ•ฉํ•จ์œผ๋กœ์จ ์‚ฌ๋žŒ์˜ ์˜๋ฏธ ํŒ๋‹จ์— ๋” ๊ฐ€๊นŒ์šด ํ‰๊ฐ€๋ฅผ ๋งŒ๋“ค๋ ค๊ณ  ํ•œ๋‹ค.


๐Ÿ” GitHub ์ €์žฅ์†Œ์—์„œ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋Š” ๊ฒƒ

๊ณต๊ฐœ๋œ GitHub ์ €์žฅ์†Œ์—๋Š” SHOE ํ‰๊ฐ€๋ฅผ ์‹คํ–‰ํ•˜๊ธฐ ์œ„ํ•œ ์ฝ”๋“œ์™€ ์˜ˆ์‹œ prediction์ด ํฌํ•จ๋˜์–ด ์žˆ๋‹ค.

์ฃผ์š” ๊ตฌ์กฐ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

  • annotations/: ground truth, synset mapping, pre-computed SHOE score table
  • predictions/: Qwen, LAIN ์˜ˆ์‹œ prediction CSV
  • f1_soft.py: SHOE mF1 evaluator
  • map_soft.py: SHOE mAP evaluator
  • shoef1.sh: SHOE mF1 quick-run script
  • shoemap.sh: SHOE mAP quick-run script

์ž…๋ ฅ prediction CSV๋Š” ๋Œ€๋žต ๋‹ค์Œ ์ •๋ณด๋ฅผ ํฌํ•จํ•ด์•ผ ํ•œ๋‹ค.

  • image filename
  • human bounding box
  • object bounding box
  • predicted verb
  • predicted object
  • verb synsets
  • object synsets
  • confidence score

confidence score๋Š” SHOE mAP์—์„œ๋Š” ํ•„์ˆ˜์ด๊ณ , SHOE mF1์—์„œ๋Š” filtering์— ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค.

README์—์„œ๋Š” open-vocabulary model์ด discrete confidence score๋ฅผ ๋‚ด์ง€ ์•Š๋Š” ๊ฒฝ์šฐ token probability๋ฅผ proxy๋กœ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค๊ณ  ์„ค๋ช…ํ•œ๋‹ค.


๐Ÿงญ CrossHOI-Bench์™€๋Š” ๋ญ๊ฐ€ ๋‹ค๋ฅธ๊ฐ€?

์ตœ๊ทผ HOI ํ‰๊ฐ€ ์—ฐ๊ตฌ๋ฅผ ๋ณด๋ฉด CrossHOI-Bench์™€ SHOE๊ฐ€ ๋น„์Šทํ•œ ๋ฌธ์ œ์˜์‹์„ ๊ณต์œ ํ•œ๋‹ค.

๋‘˜ ๋‹ค ๊ธฐ์กด HOI ํ‰๊ฐ€๊ฐ€ ๋„ˆ๋ฌด rigidํ•˜๋‹ค๋Š” ์ ์„ ์ง€์ ํ•œ๋‹ค.

ํ•˜์ง€๋งŒ ์ ‘๊ทผ ๋ฐฉ์‹์€ ๋‹ค๋ฅด๋‹ค.

CrossHOI-Bench๋Š” benchmark format์„ ๋ฐ”๊พผ๋‹ค.

  • HOI๋ฅผ ๋ณต์ˆ˜ ์ •๋‹ต ๊ฐ๊ด€์‹ ๋ฌธ์ œ๋กœ ์žฌ๊ตฌ์„ฑ
  • curated negative๋ฅผ ์ œ๊ณต
  • VLM๊ณผ HOI ์ „์šฉ ๋ชจ๋ธ์„ ๊ฐ™์€ question format์œผ๋กœ ๋น„๊ต

๋ฐ˜๋ฉด SHOE๋Š” evaluation metric์„ ๋ฐ”๊พผ๋‹ค.

  • ๊ธฐ์กด prediction๊ณผ ground truth๋ฅผ ๊ทธ๋Œ€๋กœ ๋‘๋˜
  • exact label match ๋Œ€์‹  semantic similarity๋ฅผ ์‚ฌ์šฉ
  • mAP/mF1 ๊ณ„์‚ฐ์— soft score๋ฅผ ๋ฐ˜์˜

์ฆ‰, CrossHOI-Bench๊ฐ€ โ€œ์‹œํ—˜์ง€๋ฅผ ์ƒˆ๋กœ ๋งŒ๋“ค์žโ€์— ๊ฐ€๊น๋‹ค๋ฉด, SHOE๋Š” โ€œ์ฑ„์  ๋ฐฉ์‹์„ ๋” ์˜๋ฏธ์ ์œผ๋กœ ๋งŒ๋“ค์žโ€์— ๊ฐ€๊น๋‹ค.

๋‘˜์€ ๊ฒฝ์Ÿ ๊ด€๊ณ„๋ผ๊ธฐ๋ณด๋‹ค ์„œ๋กœ ๋ณด์™„์ ์ด๋‹ค.

Open-vocabulary HOI ์‹œ๋Œ€์—๋Š” ์ข‹์€ benchmark๋„ ํ•„์š”ํ•˜๊ณ , ์ข‹์€ semantic metric๋„ ํ•„์š”ํ•˜๋‹ค.


๐Ÿš€ ์ด ๋…ผ๋ฌธ์ด ์ค‘์š”ํ•œ ์ด์œ 

SHOE๊ฐ€ ์ค‘์š”ํ•œ ์ด์œ ๋Š” ํ‰๊ฐ€๊ฐ€ ๋ชจ๋ธ ๊ฐœ๋ฐœ ๋ฐฉํ–ฅ์„ ๋ฐ”๊พธ๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค.

metric์ด exact label match๋งŒ ๋ณด๊ฒŒ ๋˜๋ฉด, ๋ชจ๋ธ์€ ์‚ฌ๋žŒ์ด ๋ณด๊ธฐ์—๋Š” ๋งž๋Š” ํ‘œํ˜„์„ ํ•ด๋„ ์ ์ˆ˜๋ฅผ ๋ชป ๋ฐ›๋Š”๋‹ค.

๊ทธ๋Ÿฌ๋ฉด ์—ฐ๊ตฌ์ž๋Š” ๊ฒฐ๊ตญ benchmark label์— ๋”ฑ ๋งž๋Š” ๋‹ต์„ ๋‚ด๋„๋ก ๋ชจ๋ธ์„ ์กฐ์ •ํ•˜๊ฒŒ ๋œ๋‹ค.

ํ•˜์ง€๋งŒ open-vocabulary vision-language model์˜ ์žฅ์ ์€ label set ๋ฐ–์˜ ํ‘œํ˜„๋ ฅ์ด๋‹ค.

๋ชจ๋ธ์ด grasp cup, hold cup, pick up cup์ฒ˜๋Ÿผ ๋‹ค์–‘ํ•œ ํ‘œํ˜„์„ ํ•  ์ˆ˜ ์žˆ๋‹ค๋ฉด, ํ‰๊ฐ€ metric๋„ ๊ทธ ์ฐจ์ด๋ฅผ ์–ด๋А ์ •๋„ ์ดํ•ดํ•ด์•ผ ํ•œ๋‹ค.

SHOE๋Š” ์ด ๋ฌธ์ œ๋ฅผ ๋‹ค์Œ์ฒ˜๋Ÿผ ์ •๋ฆฌํ•œ๋‹ค.

  • HOI label์€ ๋‹จ์ˆœ class id๊ฐ€ ์•„๋‹ˆ๋ผ ์˜๋ฏธ ๊ตฌ์กฐ๋ฅผ ๊ฐ€์ง„๋‹ค.
  • verb์™€ object์˜ ์œ ์‚ฌ๋„๋ฅผ ๋”ฐ๋กœ ๋ด์•ผ ํ•œ๋‹ค.
  • localization์ด ๋งž๋Š” ์˜ˆ์ธก์— ๋Œ€ํ•ด์„œ๋Š” semantic partial credit์„ ์ค„ ์ˆ˜ ์žˆ๋‹ค.
  • confidence๊ฐ€ ์žˆ๋Š” detector์™€ confidence๊ฐ€ ์—†๋Š” generative model ๋ชจ๋‘ ํ‰๊ฐ€ํ•ด์•ผ ํ•œ๋‹ค.

์ด ๊ด€์ ์€ ์•ž์œผ๋กœ ๋” ์ค‘์š”ํ•ด์งˆ ๊ฐ€๋Šฅ์„ฑ์ด ํฌ๋‹ค.

VLM์ด ๋ฐœ์ „ํ• ์ˆ˜๋ก ๋ชจ๋ธ ์ถœ๋ ฅ์€ ์ ์  ๋” ์ž์—ฐ์–ด์— ๊ฐ€๊นŒ์›Œ์งˆ ๊ฒƒ์ด๊ณ , ํ‰๊ฐ€๋„ ๋‹จ์ˆœ label matching์—์„œ semantic evaluation์œผ๋กœ ์ด๋™ํ•  ์ˆ˜๋ฐ–์— ์—†๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค.


โš ๏ธ ๊ทธ๋ž˜๋„ ์กฐ์‹ฌํ•ด์„œ ๋ด์•ผ ํ•  ์ 

SHOE๊ฐ€ ๋งค์šฐ ํฅ๋ฏธ๋กœ์šด metric์ด์ง€๋งŒ, ๋ชจ๋“  ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๋Š” ๊ฒƒ์€ ์•„๋‹ˆ๋‹ค.

์ฒซ์งธ, semantic similarity๊ฐ€ ๋†’๋‹ค๊ณ  ํ•ด์„œ ํ•ญ์ƒ ์‹œ๊ฐ์ ์œผ๋กœ ๋งž๋Š” ๊ฒƒ์€ ์•„๋‹ˆ๋‹ค.

์˜ˆ๋ฅผ ๋“ค์–ด sit on couch์™€ lean on couch๊ฐ€ ์˜๋ฏธ์ ์œผ๋กœ ๊ฐ€๊นŒ์›Œ๋„, ํŠน์ • ์ด๋ฏธ์ง€์—์„œ๋Š” ๋‘˜ ์ค‘ ํ•˜๋‚˜๋งŒ ์ •ํ™•ํ•  ์ˆ˜ ์žˆ๋‹ค.

๋”ฐ๋ผ์„œ SHOE ์ ์ˆ˜๋Š” โ€œ์˜๋ฏธ์ ์œผ๋กœ ๊ฐ€๊นŒ์šด ์ •๋„โ€๋ฅผ ๋ฐ˜์˜ํ•˜๋Š” ๊ฒƒ์ด์ง€, ๋ชจ๋“  ์‹œ๊ฐ์  ์„ธ๋ถ€ ์ฐจ์ด๋ฅผ ์™„๋ฒฝํ•˜๊ฒŒ ํŒ์ •ํ•˜๋Š” ๊ฒƒ์€ ์•„๋‹ˆ๋‹ค.

๋‘˜์งธ, LLM ๊ธฐ๋ฐ˜ similarity table ์ž์ฒด์—๋„ bias๊ฐ€ ์žˆ์„ ์ˆ˜ ์žˆ๋‹ค.

์—ฌ๋Ÿฌ LLM ํ‰๊ท ์„ ์“ฐ๊ณ  human judgment์™€ ๋น„๊ตํ–ˆ๋‹ค๋Š” ์ ์€ ๊ฐ•์ ์ด์ง€๋งŒ, ์–ธ์–ด์  ์œ ์‚ฌ๋„์™€ ์‹ค์ œ ์‹œ๊ฐ์  affordance๊ฐ€ ์–ธ์ œ๋‚˜ ์ผ์น˜ํ•˜์ง€๋Š” ์•Š๋Š”๋‹ค.

์…‹์งธ, WordNet synset mapping์ด ํ•„์š”ํ•œ ๋งŒํผ, ์™„์ „ํžˆ ์ž์œ ๋กœ์šด ์ž์—ฐ์–ด ํ‘œํ˜„์„ ๋‹ค๋ฃจ๋ ค๋ฉด preprocessing ํ’ˆ์งˆ๋„ ์ค‘์š”ํ•˜๋‹ค.

์ฆ‰, SHOE๋Š” exact-match mAP๋ฅผ ๋Œ€์ฒดํ•  ์ˆ˜ ์žˆ๋Š” ๊ฐ•๋ ฅํ•œ ํ›„๋ณด์ด์ง€๋งŒ, ํŠนํžˆ open-vocabulary ํ‰๊ฐ€์—์„œ๋Š” ๊ธฐ์กด metric๊ณผ ํ•จ๊ป˜ ๋ณด๋ฉด์„œ ํ•ด์„ํ•˜๋Š” ๊ฒƒ์ด ์ข‹๋‹ค.


๐Ÿง  ๋‚˜์˜ ์ฝ”๋ฉ˜ํŠธ!

SHOE๋Š” โ€œํ‰๊ฐ€ metric๋„ ์ด์ œ ์–ธ์–ด๋ฅผ ์ดํ•ดํ•ด์•ผ ํ•œ๋‹คโ€๋Š” ํ๋ฆ„์„ ์ž˜ ๋ณด์—ฌ์ฃผ๋Š” ์—ฐ๊ตฌ๋ผ๊ณ  ๋А๊ปด์ง„๋‹ค.

๊ธฐ์กด computer vision ํ‰๊ฐ€์—์„œ๋Š” label์ด ๋งž๋ƒ ํ‹€๋ฆฌ๋ƒ๊ฐ€ ์ค‘์š”ํ–ˆ๋‹ค. cat์ด๋ฉด cat, dog์ด๋ฉด dog์ฒ˜๋Ÿผ class๊ฐ€ ๋ถ„๋ฆฌ๋˜์–ด ์žˆ์—ˆ๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค.

ํ•˜์ง€๋งŒ HOI๋Š” ํ›จ์”ฌ ๋” ์–ธ์–ด์ ์ด๋‹ค.

hold, grasp, carry, pick up์€ ์„œ๋กœ ๊ฒน์น˜๋Š” ์˜๋ฏธ๊ฐ€ ์žˆ๊ณ , sit on, lean on, rest on๋„ ์ด๋ฏธ์ง€์— ๋”ฐ๋ผ ๊ฒฝ๊ณ„๊ฐ€ ํ๋ฆด ์ˆ˜ ์žˆ๋‹ค.

ํŠนํžˆ VLM์ด ๋“ฑ์žฅํ•œ ์ดํ›„์—๋Š” ๋ชจ๋ธ์ด ์‚ฌ๋žŒ์ด ๋งํ•˜๋“ฏ ๋‹ต์„ ๋‚ด๊ธฐ ๋•Œ๋ฌธ์—, ํ‰๊ฐ€๋„ ์‚ฌ๋žŒ์ด ์ดํ•ดํ•˜๋“ฏ ์–ด๋А ์ •๋„์˜ ์˜๋ฏธ์  ์œ ์—ฐ์„ฑ์„ ๊ฐ€์ ธ์•ผ ํ•œ๋‹ค.

๊ทธ๋Ÿฐ ์ ์—์„œ SHOE๋Š” open-vocabulary HOI ์—ฐ๊ตฌ์— ๊ฝค ์‹ค์šฉ์ ์ธ ๋„๊ตฌ๊ฐ€ ๋  ์ˆ˜ ์žˆ๋‹ค.

๋‚˜์—๊ฒŒ ๊ฐ€์žฅ ์ธ์ƒ์ ์ธ ๋ถ€๋ถ„์€ SHOE๊ฐ€ ๋‹จ์ˆœํžˆ โ€œLLM์—๊ฒŒ ์ ์ˆ˜ ๋งค๊ฒจ๋‹ฌ๋ผโ€๊ณ  ํ•˜์ง€ ์•Š๋Š”๋‹ค๋Š” ์ ์ด๋‹ค.

๋Œ€์‹  HOI๋ฅผ verb์™€ object๋กœ ๋‚˜๋ˆ„๊ณ , synset ๊ธฐ๋ฐ˜ table์„ ๋งŒ๋“ค๊ณ , detection matching๊ณผ soft TP/FP/FN์„ ๊ฒฐํ•ฉํ•œ๋‹ค.

์ฆ‰, ๊ธฐ์กด detection metric์˜ ๊ตฌ์กฐ๋ฅผ ๋ฒ„๋ฆฌ์ง€ ์•Š์œผ๋ฉด์„œ, semantic similarity๋ฅผ ์ž์—ฐ์Šค๋Ÿฝ๊ฒŒ ๋ผ์›Œ ๋„ฃ๋Š”๋‹ค.

์•ž์œผ๋กœ HOI๋ฟ ์•„๋‹ˆ๋ผ scene graph generation, visual relationship detection, embodied AI action recognition ๊ฐ™์€ ๋ถ„์•ผ์—์„œ๋„ ๋น„์Šทํ•œ semantic evaluation์ด ๋” ์ค‘์š”ํ•ด์งˆ ๊ฒƒ ๊ฐ™๋‹ค.

๊ฒฐ๊ตญ open-vocabulary ๋ชจ๋ธ์„ ์ œ๋Œ€๋กœ ํ‰๊ฐ€ํ•˜๋ ค๋ฉด, ์ •๋‹ต ๋ฌธ์ž์—ด์„ ๋งžํ˜”๋Š”์ง€๊ฐ€ ์•„๋‹ˆ๋ผ ์‚ฌ๋žŒ์ด ๋ณด๊ธฐ์— ๊ฐ™์€ ์˜๋ฏธ์˜ ์‹œ๊ฐ์  ๊ด€๊ณ„๋ฅผ ์ดํ•ดํ–ˆ๋Š”์ง€๋ฅผ ๋ฌผ์–ด์•ผ ํ•˜๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค.

SHOE๋Š” ๊ทธ ๋ฐฉํ–ฅ์œผ๋กœ ๊ฐ€๋Š” ๊ฝค ๊น”๋”ํ•œ ํ•œ ๊ฑธ์Œ์ด๋‹ค.

This post is licensed under CC BY 4.0 by the author.