Post

๐Ÿงฉ AVA-Bench: Vision Foundation Model์˜ ์›์ž์  ์‹œ๊ฐ ๋Šฅ๋ ฅ ํ‰๊ฐ€ํ•˜๊ธฐ (CVPR 2026)

๐Ÿงฉ AVA-Bench: Vision Foundation Model์˜ ์›์ž์  ์‹œ๊ฐ ๋Šฅ๋ ฅ ํ‰๊ฐ€ํ•˜๊ธฐ (CVPR 2026)

๐Ÿงฉ AVA-Bench โ€” ํ•ต์‹ฌ ๋…ผ๋ฌธ ๋ฆฌํฌํŠธ

๋…ผ๋ฌธ: AVA-Bench: Atomic Visual Ability Benchmark for Vision Foundation Models
์ €์ž: Zheda Mai, Arpita Chowdhury et al. (The Ohio State University, Adobe Research, Boston University)
ํ•™ํšŒ: CVPR 2026
Project Page : https://zheda-mai.github.io/AVA-Bench/
ํ•ต์‹ฌ ์š”์•ฝ: ๋ณต์žกํ•œ VQA ์ ์ˆ˜๋งŒ์œผ๋กœ๋Š” Vision Foundation Model(VFM)์˜ ์ง„์งœ ๋Šฅ๋ ฅ์„ ๋‚ฑ๋‚ฑ์ด ํŒŒํ—ค์น  ์ˆ˜ ์—†๋‹ค!
14๊ฐ€์ง€ ์›์ž์  ์‹œ๊ฐ ๋Šฅ๋ ฅ(Atomic Visual Abilities)์œผ๋กœ ์™„์ „ํžˆ ๋ถ„ํ•ดํ•˜์—ฌ ๋ชจ๋ธ์˜ ์ง„์งœ ๊ฐ•์ ๊ณผ ์•ฝ์ (Ability Fingerprint)์„ ํŒŒ์•…ํ•˜์ž!


๐Ÿงฉ ๋ฌธ์ œ ์ •์˜: ๋ชจ๋ธ์ด ์™œ ํ‹€๋ ธ๋Š”์ง€ ์ •ํ™•ํžˆ ์•Œ๊ณ  ์‹ถ์–ด!!

์ตœ๊ทผ ์ˆ˜๋งŽ์€ Vision Foundation Model(VFM)๋“ค์ด ์Ÿ์•„์ ธ ๋‚˜์˜ค๊ณ  ์žˆ๊ณ , ์ด๋“ค์˜ ์„ฑ๋Šฅ์„ ๋น„๊ตํ•˜๊ธฐ ์œ„ํ•ด ๋‹ค์–‘ํ•œ VQA(Visual Question Answering) ๋ฒค์น˜๋งˆํฌ๊ฐ€ ์‚ฌ์šฉ๋œ๋‹ค.
ํ•˜์ง€๋งŒ ๊ธฐ์กด ๋ฒค์น˜๋งˆํฌ์—๋Š” ์น˜๋ช…์ ์ธ ํ•œ๊ณ„๊ฐ€ ์žˆ์—ˆ์œผ๋‹ˆโ€ฆ

๋ฌธ์ œ 1 โ€” ๋Šฅ๋ ฅ์˜ ํ˜ผ์žฌ (Skill Confounding)

์–ด๋–ค VQA ์งˆ๋ฌธ์„ ๋งž์ถ”๊ฑฐ๋‚˜ ํ‹€๋ ธ์„ ๋•Œ, ๊ทธ ์›์ธ์ด ๊ณต๊ฐ„ ์ธ์ง€๋ ฅ ๋ถ€์กฑ ๋•Œ๋ฌธ์ธ์ง€, ์‚ฌ๋ฌผ ์ธ์‹๋ ฅ ๋ถ€์กฑ ๋•Œ๋ฌธ์ธ์ง€, ํ˜น์€ ๋ณตํ•ฉ์ ์ธ ๋…ผ๋ฆฌ์  ๊ฒฐํ•จ ๋•Œ๋ฌธ์ธ์ง€ ์ •ํ™•ํžˆ ์•Œ๊ธฐ ์–ด๋ ต๋‹ค!

๋ฌธ์ œ 2 โ€” ๋ฐ์ดํ„ฐ์˜ ๋ถˆ์ผ์น˜ (Data Mismatch)

VFM์„ ํŠœ๋‹ํ•˜๋Š” ๋ฐ ์‚ฌ์šฉ๋œ instruction ๋ฐ์ดํ„ฐ์…‹์˜ ๋ถ„ํฌ์™€ ํ‰๊ฐ€ ๋ฐ์ดํ„ฐ์…‹์˜ ๋ถ„ํฌ๊ฐ€ ์„œ๋กœ ๋งค์นญ๋˜์ง€ ์•Š์•„, ์˜ฌ๋ฐ”๋ฅธ ๋ชจ๋ธ ๋น„๊ต ํ‰๊ฐ€๊ฐ€ ๋ถˆ๊ฐ€๋Šฅํ•˜๊ฑฐ๋‚˜ ํŽธํ–ฅ๋  ์ˆ˜ ์žˆ๋‹ค.


๐Ÿง  ํ•ด๊ฒฐ์ฑ…: 14๊ฐ€์ง€ ์›์ž์  ์‹œ๊ฐ ๋Šฅ๋ ฅ(AVAs)์œผ๋กœ ์ชผ๊ฐœ๊ธฐ!

๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ๋ณต์žกํ•œ ๋ฌธ์ œ๋ฅผ ๊ฑท์–ด๋‚ด๊ณ , ๋ชจ๋ธ์˜ ๋ˆˆ(Vision)์„ ๊ตฌ์„ฑํ•˜๋Š” 14๊ฐ€์ง€ ํ•ต์‹ฌ โ€˜์›์ž์  ์‹œ๊ฐ ๋Šฅ๋ ฅ(Atomic Visual Abilities)โ€™์„ ์ •์˜ํ•˜์—ฌ ๊ฐ๊ฐ ๋…๋ฆฝ์ ์œผ๋กœ ์ง„๋‹จํ•œ๋‹ค!

โœ” 14๊ฐ€์ง€ Atomic Visual Abilities (AVAs)

  • ๊ธฐํ•˜/๊ณต๊ฐ„(Geometric/Spatial): Localization (์œ„์น˜ ์ฐพ๊ธฐ), Spatial Reasoning (๊ณต๊ฐ„ ์ถ”๋ก ), Absolute Depth (์ ˆ๋Œ€ ๊นŠ์ด), Relative Depth (์ƒ๋Œ€ ๊นŠ์ด), Orientation (๋ฐฉํ–ฅ)
  • ์ธ์ง€/์ธ์‹(Perceptual/Recognition): Counting (๊ฐœ์ˆ˜ ์„ธ๊ธฐ), Color (์ƒ‰์ƒ), Object (์‚ฌ๋ฌผ), Texture (์งˆ๊ฐ), Action (ํ–‰๋™), Emotion (๊ฐ์ •), Scene (์žฅ๋ฉด), OCR (ํ…์ŠคํŠธ ์ธ์ง€) ๋“ฑ

์ด๋ฅผ 26๊ฐœ์˜ ๊ธฐ์กด ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ๋ถ€ํ„ฐ ์—„์„ ํ•œ ์•ฝ 218,000๊ฐœ์˜ ์ด๋ฏธ์ง€-์งˆ๋ฌธ ์Œ์œผ๋กœ ๊ตฌ์„ฑํ•˜์—ฌ ๊ผผ๊ผผํ•˜๊ฒŒ ํ‰๊ฐ€ํ•œ๋‹ค!

์ด ๋ฒค์น˜๋งˆํฌ๋ฅผ ๋Œ๋ฆฌ๊ณ  ๋‚˜๋ฉด ๊ฐ VFM ๋ชจ๋ธ๋งˆ๋‹ค ๊ณ ์œ ์˜ ๊ฐ•์•ฝ์ ์„ ์‹œ๊ฐํ™”ํ•œ โ€˜Ability Fingerprintโ€™๋ฅผ ์–ป์„ ์ˆ˜ ์žˆ๋‹ค!

  • DINOv2 ๊ฐ™์€ ์ž๊ธฐ์ง€๋„ํ•™์Šต(Self-Supervised) ๋ชจ๋ธ โžก๏ธ ๊นŠ์ด ์ธ์‹(Depth)์ด๋‚˜ ์œ„์น˜ ์ธ์‹(Geometric) ๊ฐ™์€ ๊ณต๊ฐ„ ์ •๋ณด ์ฒ˜๋ฆฌ์— ์—„์ฒญ ๋›ฐ์–ด๋‚จ!
  • SigLIP, AIMv2 ๊ฐ™์€ ์–ธ์–ด-์ด๋ฏธ์ง€ ๋Œ€์กฐํ•™์Šต ๋ชจ๋ธ โžก๏ธ ์นดํ…Œ๊ณ ๋ฆฌ ๋ถ„๋ฅ˜๋‚˜ ํ…์ŠคํŠธ ์ธ์‹(OCR) ๋“ฑ ์˜๋ฏธ๋ก ์ ์ธ ์ „๋ฐ˜์  ์ธ์ง€์— ์šฐ์ˆ˜ํ•จ!

๐Ÿš€ ์ €๋น„์šฉ ๊ณ ํšจ์œจ ํ‰๊ฐ€ ํ”„๋กœํ† ์ฝœ

๊ฑฐ๋Œ€ํ•œ ๋‹ค์ค‘๋ชจ๋‹ฌ ๋ชจ๋ธ์„ ํ‰๊ฐ€ํ•  ๋•Œ ๋“ค์–ด๊ฐ€๋Š” ์—ฐ์‚ฐ ๋น„์šฉ๋„ ํฐ ๋ฌธ์ œ ์ค‘ ํ•˜๋‚˜๋‹ค.
์ด ๋…ผ๋ฌธ์—์„œ๋Š” ํ‰๊ฐ€์šฉ ๋ฉ”ํƒ€ ๋ชจ๋ธ๋กœ 7B ํฌ๊ธฐ์˜ ๋ฌด๊ฑฐ์šด LLM ๋Œ€์‹ , 0.5B ์ˆ˜์ค€์˜ ๊ฐ€๋ฒผ์šด ์†Œํ˜• LLM์„ ํ™œ์šฉํ•ด๋„ ํ‰๊ฐ€ ์‹ ๋ขฐ์„ฑ๊ณผ ๋ชจ๋ธ ๋žญํ‚น์˜ ์ผ์น˜๋„๊ฐ€ ๊ฑฐ์˜ ์œ ์ง€๋จ์„ ์ž…์ฆํ–ˆ๋‹ค!

  • ํ‰๊ฐ€ ์—ฐ์‚ฐ ๋น„์šฉ์„ ๋ฌด๋ ค 8๋ฐฐ(8x)๋‚˜ ์ ˆ๊ฐ!
  • ๋ฆฌ์†Œ์Šค๊ฐ€ ์ œํ•œ๋œ ํ™˜๊ฒฝ์—์„œ๋„ ๋น ๋ฅด๊ณ  ์ •ํ™•ํ•œ ๋ฒค์น˜๋งˆํ‚น์ด ๊ฐ€๋Šฅํ•ด์ง„๋‹ค!

๐Ÿง  ๋‚˜์˜ ์ฝ”๋ฉ˜ํŠธ!

์–ด๋–ค ์ธ๊ณต์ง€๋Šฅ์ด โ€œ๋” ์ข‹์€ ๋ชจ๋ธ์ธ๊ฐ€โ€๋ผ๋Š” ๋‹จ์ˆœํ•œ ์ค„์„ธ์šฐ๊ธฐ์‹ ํ‰๊ฐ€๋ฅผ ๋„˜์–ด, โ€œ๋‚ด ์„œ๋น„์Šค/ํ”„๋กœ์ ํŠธ์— ๋”ฑ ๋งž๋Š” VFM์€ ๋ฌด์—‡์ธ๊ฐ€โ€๋ฅผ ๊ฒฐ์ •ํ•  ์ˆ˜ ์žˆ๋Š” ํ›Œ๋ฅญํ•œ ์ง„๋‹จ ๋„๊ตฌ ์ฒด๊ณ„๋ผ๊ณ  ๋А๊ปด์กŒ๋‹ค!

๋กœ๋ณดํ‹ฑ์Šค๋‚˜ ์ž์œจ์ฃผํ–‰์ฒ˜๋Ÿผ ๊ณต๊ฐ„ ์ธ์ง€๊ฐ€ ์ค‘์š”ํ•  ๋•Œ๋Š” Spatial/Depth์— ๊ฐ•ํ•œ ๋ชจ๋ธ(์˜ˆ: DINOv2 ๊ธฐ๋ฐ˜)์„, ์ด์ปค๋จธ์Šค๋‚˜ ์ฝ˜ํ…์ธ  ๋ถ„๋ฅ˜์ฒ˜๋Ÿผ ์‚ฌ๋ฌผ ์ •๋ณด ์‹๋ณ„์ด ์ค‘์š”ํ•  ๋•Œ๋Š” General Recognition์— ๊ฐ•ํ•œ ๋ชจ๋ธ(์˜ˆ: SigLIP ๊ณ„์—ด)์„ ์„ ํƒํ•˜๋Š” ์‹์˜ ์‹ค๋ฌด์ ์ธ ์˜์‚ฌ๊ฒฐ์ • ํ”„๋ ˆ์ž„์„ ์ฃผ๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค.

๋˜ํ•œ, ํ‰๊ฐ€ ๋ชจ๋ธ์˜ ํฌ๊ธฐ๋ฅผ 0.5B๋กœ ์ค„์—ฌ 8x ํšจ์œจํ™”๋ฅผ ์ด๋ฃฌ ํŒŒํŠธ๋Š” ๋ฒค์น˜๋งˆํฌ ์ž์ฒด์˜ ์‹ค์šฉ์„ฑ๋„ ๊ทน๋Œ€ํ™”ํ•˜์—ฌ ํ•™๊ณ„๋ฅผ ๋„˜์–ด ์‚ฐ์—…๊ณ„์—์„œ๋„ ์š”๊ธดํ•˜๊ฒŒ ์“ฐ์ผ ์ˆ˜ ์žˆ์„ ๊ฒƒ ๊ฐ™์•„ ๋งค์šฐ ์ธ์ƒ ๊นŠ์—ˆ๋‹ค!

This post is licensed under CC BY 4.0 by the author.