Post

๐Ÿ‘๏ธ MLLMs Know Where to Look: Training-free Visual Detail Perception

๐Ÿ‘๏ธ MLLMs Know Where to Look: Training-free Visual Detail Perception

๐Ÿ‘๏ธ MLLMs Know Where to Look: Training-free Perception of Visual Details


๐Ÿง  TL;DR in 3 Lines

  1. MLLMs are generally good at knowing where to look,
    but often fail to understand what theyโ€™re seeing.

  2. Simply cropping the relevant part of the image and feeding it back
    significantly improves detail-level recognition.

  3. If the image is too large, it is split and reprocessed to ensure accurate attention.


โš ๏ธ Problem Background

prob1

  • MLLMs often fail on questions about small objects in an image,
    but they succeed if we crop and provide only the relevant region.

๐Ÿ“š Datasets Used

The authors validate their method on the following 6 datasets:

DatasetPurposeImage TypeQuestion FocusExternal KnowledgeExample Models
DocVQADocument-level question answeringDocument images (PDFs)Text extraction + layout understandingโŒLayoutLM, Donut, DocFormer
TextVQAScene text-based VQANatural images w/ textText in context of visual sceneโŒM4C, GRILL, LLaVA
POPEEvaluating model bias and hallucinationMixed image typesRobustness to misleading contextsโŒBLIP2, Pythia
A-OKVQAKnowledge-based multiple-choice VQANatural imagesExternal knowledge + choice selectionโœ…ReGAT, RAVQA, NoteMR
GQARelation reasoning and scene understandingComplex scenesLogic and spatial reasoningโŒMAC, NS-VQA, GraftNet
VQAv2General-purpose VQA benchmarkNatural imagesObject, attribute, and general questionsโŒUpDn, Pythia, LXMERT

๐Ÿ”ง Three Key Investigations

  1. Can humans solve these problems better just by cropping?
    โ†’ Manually cropping the region significantly improved model performance!

  2. Do LLMs fail because they donโ€™t know where to look, or because they canโ€™t understand even when looking correctly?
    โ†’ Itโ€™s the latter: they look in the right place but misinterpret it.

  3. Then what if we just show them the right region only?
    โ†’ That works very well!

0. Human cropping improves accuracy

crop_Effect

  • When humans crop only the relevant region of the image,
  • MLLMs answer detail-based questions much more accurately.

๐Ÿ” 1. Do MLLMs attend to the right place?

looking

  • By visualizing attention layers,
  • It turns out the model does look in the right area even when it gives a wrong answer.

โœ‚๏ธ 2. Just give the right region โ†’ better performance!

cropping

  • As seen above, cropping and reinserting alone greatly boosts performance
  • So, how to crop effectively?
  • The authors propose 3 attention-based cropping strategies:
MethodDescription
Rel-Att (Relative Attention)Compares attention maps between the true question and a generic one to highlight the difference
Grad-Att (Gradient-weighted Attention)Uses gradients to find regions most sensitive to the modelโ€™s confidence
Pure-Grad (Input Gradient)Uses input image gradients to locate visually salient pixels

Cropping pipeline:

  • Input: image + question
  • Process: compute attention map via one of the above โ†’ derive ROI crop
  • Output: crop image โ†’ reinsert to MLLM โ†’ generate answer

The paper also compares cropping methods using external tools like YOLO, CLIP, and SAM:

Surprisingly, even against SOTA external methods, their proposed internal methods held up well.

crop_res

MethodOne-line Summary
CLIP ViCropUses CLIP similarity to iteratively crop toward the most semantically aligned region
YOLO ViCropSelects bounding boxes from YOLO with highest CLIP similarity to the question
SAM ViCropConverts segmentation masks from SAM into bounding boxes, then selects the one with best CLIP match

๐Ÿงช Experiment Results

  • The system performs inference-only croppingโ€”no retraining required
  • Large images are pre-cropped to better guide attention
  • Evaluation covers multiple datasets and question types

๐Ÿ“ˆ Key Results

res1

  • Attention-based crops like Rel-Att and Grad-Att outperform other approachesโ€”especially for small-object questions.

res2

  • Cropping greatly helps when image resolution is high.

Summary of Effects:

SetupPerformance Impact
Full image onlyPoor on detail-based questions
Crop via attention-guided methodsMuch higher accuracy
No retraining neededZero-shot + Inference-time only

Overall, this approach greatly improves fine-grained perception,
even without scaling up the model size.


โœ… Conclusion & Impact

  • The paper shows MLLMs already know where to look,
    but need help seeing better via focused cropping.
  • Significant performance gains are possible without any retrainingโ€”just with attention-based inference.
  • Has strong applicability in domains like OCR, tiny-object detection, or interactive AI tutors.

โ€œMLLMs know where to look. Letโ€™s help them see better.โ€


๐Ÿ‘๏ธ (ํ•œ๊ตญ์–ด) MLLMs Know Where to Look: Training-free ์‹œ๊ฐ ๋””ํ…Œ์ผ ์ธ์‹


๐Ÿง  3์ค„ ์š”์•ฝ

  1. MLLM์€ ์ด๋ฏธ์ง€ ๋‚ด โ€˜์–ด๋””๋ฅผ ๋ณด๋Š”์ง€โ€™๋Š” ์ž˜ ํŒŒ์•…ํ•˜์ง€๋งŒ,
    โ€˜๋ฌด์—‡์„ ๋ณด๋Š”์ง€โ€™๋Š” ์ •ํ™•ํžˆ ์ธ์‹ํ•˜์ง€ ๋ชปํ•˜๋Š” ํ•œ๊ณ„๊ฐ€ ์žˆ์Œ.

  2. ์ด๋ฏธ์ง€์˜ ์ค‘์š”ํ•œ ๋ถ€๋ถ„์„ cropํ•ด์„œ ๋‹ค์‹œ ์ž…๋ ฅํ•˜๋ฉด,
    ๋ชจ๋ธ์ด ์‹œ๊ฐ์  ๋””ํ…Œ์ผ์„ ํ›จ์”ฌ ์ •ํ™•ํžˆ ์ธ์‹ํ•จ.

  3. ์ด๋ฏธ์ง€๊ฐ€ ๋„ˆ๋ฌด ํฐ ๊ฒฝ์šฐ์—๋Š” ์ •ํ™•ํ•œ attention์„ ์œ„ํ•ด ์ž˜๋ผ์„œ ์‚ฌ์šฉํ•˜๊ณ  ๋ถ™์ž„!


โš ๏ธ ๋ฐฐ๊ฒฝ: ๊ธฐ์กด ๋ฌธ์ œ์  ์š”์•ฝ

prob1

  • ์ด๋ฏธ์ง€ ๋‚ด์—์„œ ์ž‘์€ ๊ฐ์ฒด์— ๋Œ€ํ•œ ์งˆ๋ฌธ์„ ํ—€์„๋•Œ ๋‹ต์„ ํ‹€๋ฆฌ์ง€๋งŒ, ํ•ด๋‹น ๋ถ€๋ถ„๋งŒ์„ crop ํ•ด์„œ ๋ณด์—ฌ์ฃผ๋ฉด ๋‹ต์„ ์ž˜ํ•จ

์ฐธ๊ณ . ์‚ฌ์šฉํ•œ ๋ฐ์ดํ„ฐ์…‹

  • ์—ฌ๊ธฐ์„œ๋Š” ๋กœ์ง๊ฒ€์ฆ์„ ์œ„ํ•ด ์•„๋ž˜ 6๊ฐ€์ง€ ๋ฐ์ดํ„ฐ์…‹์„ ์‚ฌ์šฉํ–ˆ์Šต๋‹ˆ๋‹ค!!
๋ฐ์ดํ„ฐ์…‹์ฃผ์š” ๋ชฉ์ ์ด๋ฏธ์ง€ ์œ ํ˜•์งˆ๋ฌธ ์ดˆ์ ์™ธ๋ถ€ ์ง€์‹ ํ•„์š”๋Œ€ํ‘œ ๋ชจ๋ธ ์˜ˆ์‹œ
DocVQA๋ฌธ์„œ ๊ธฐ๋ฐ˜ ์งˆ์˜์‘๋‹ต (์ธ๋ณด์ด์Šค, ๋ณด๊ณ ์„œ ๋“ฑ)๋ฌธ์„œ ์ด๋ฏธ์ง€ (PDF ๋“ฑ)ํ…์ŠคํŠธ ์ •๋ณด ์ถ”์ถœ + ๋ฌธ์„œ ๊ตฌ์กฐ ์ดํ•ดโŒLayoutLM, Donut, DocFormer
TextVQA์žฅ๋ฉด ๋‚ด ๊ธ€์ž๋ฅผ ํฌํ•จํ•œ ์งˆ์˜์‘๋‹ต์ž์—ฐ ์ด๋ฏธ์ง€ + ํ…์ŠคํŠธ์‹œ๊ฐ ๋ฌธ๋งฅ ์† ํ…์ŠคํŠธ ์ดํ•ดโŒM4C, GRILL, LLaVA
POPEVQA ๋ชจ๋ธ์˜ ํŽธํ–ฅ(Bias)๊ณผ ํ™˜๊ฐ(Hallucination) ํ‰๊ฐ€๋‹ค์–‘ํ•œ (ํ˜ผํ•ฉํ˜•) ์ด๋ฏธ์ง€๋ชจ๋ธ์˜ bias robustness ํ‰๊ฐ€โŒBLIP2, Pythia
A-OKVQA์™ธ๋ถ€ ์ง€์‹ ๊ธฐ๋ฐ˜ VQA + ์ •๋Ÿ‰ ํ‰๊ฐ€์ž์—ฐ ์ด๋ฏธ์ง€์ง€์‹ ๊ธฐ๋ฐ˜ ์งˆ์˜ + ์„ ํƒ์ง€ ๊ธฐ๋ฐ˜ ์‘๋‹ตโœ…ReGAT, RAVQA, NoteMR
GQA๊ด€๊ณ„ ์ถ”๋ก , ๊ฐ์ฒด ๊ฐ„ ์˜๋ฏธ์  ์—ฐ๊ฒฐ๋ณต์žกํ•œ ์žฅ๋ฉด ์ด๋ฏธ์ง€์žฅ๋ฉด ์ดํ•ด + ๊ด€๊ณ„ ๊ธฐ๋ฐ˜ ์งˆ์˜์‘๋‹ตโŒMAC, NS-VQA, GraftNet
VQAv2์ผ๋ฐ˜ VQA ๋ฒค์น˜๋งˆํฌ, ๋‹ค์–‘ํ•œ ์งˆ๋ฌธ ์œ ํ˜• ํฌํ•จ์ž์—ฐ ์ด๋ฏธ์ง€๊ฐ์ฒด, ์†์„ฑ, ์žฅ๋ฉด ๋“ฑ ์ „๋ฐ˜์  ์งˆ์˜์‘๋‹ตโŒUpDn, Pythia, LXMERT

๐Ÿ”ง 3๊ฐ€์ง€ ๋‹จ๊ณ„๋กœ ๋‚˜๋ˆ„์–ด์„œ ํ•ด๊ฒฐ ๋ฐฉ๋ฒ•์„ ์ฐพ์Œ

  1. ์ •๋ง ์ž‘์€ ๋ถ€์œ„๋ฅผ cropํ•ด์„œ ๋ณด์—ฌ์ฃผ๋ฉด ๋ฌธ์ œ๋ฅผ ์ž˜ ๋งž์ถœ๊นŒ?
    • ์‚ฌ๋žŒ์ด ํฌ๋กญํ•ด์„œ ํ…Œ์ŠคํŠธํ•ด๋ด„!!
  2. LLM์€ ์–ด๋””๋ฅผ ๋ณผ์ง€๋„ ๋ชฐ๋ผ์„œ ํ‹€๋ฆฐ๊ฑธ๊นŒ? ํ˜น์€ ๋ถ€์œ„๋Š” ์ž˜ ์ฐพ์•˜๋Š”๋ฐ ์ž˜๋ชป ์ธ์ง€ํ•œ๊ฑธ๊นŒ?
    • ๊ฒฐ๋ก ์€ ํ›„์ž, ๋ถ€์œ„๋Š” ์ž˜ ์ฐพ์•˜์ง€๋งŒ ์ž˜๋ชป ์ธ์ง€ํ•œ๊ฒƒ์ž„!

2.๊ทธ๋Ÿผ! ํ•ด๋‹น ๋ถ€์œ„๋งŒ์„ ์ œ์‹œํ•˜๋งŒ ์ž˜ ์ž‘๋™ํ• ๊นŒ??

  • ๊ทธ๋ ‡๋‹ค!!

0. ์ •๋ง ์ž‘์€ ๋ถ€์œ„๋ฅผ cropํ•ด์„œ ๋ณด์—ฌ์ฃผ๋ฉด ๋ฌธ์ œ๋ฅผ ์ž˜ ๋งž์ถœ๊นŒ?

crop_Effect

  • ์ด๋ฏธ์ง€ ๋‚ด์˜ ์ž‘์€ ๋ถ€๋ถ„์„ ๋งž์ถ”๋Š” ์งˆ๋ฌธ์—์„œ,
  • ์‚ฌ๋žŒ์ด ์ •๋‹ต๋ถ€๋ถ„๋งŒ cropํ•ด์„œ ์ œ์‹œํ•  ๊ฒฝ์šฐ ํ™•์‹คํžˆ ์ž˜ ๋Œ€๋‹ตํ•ด!!

๐Ÿ” 1. LLM์€ ์–ด๋””๋ฅผ ๋ณผ์ง€๋„ ๋ชฐ๋ผ์„œ ํ‹€๋ฆฐ๊ฑธ๊นŒ? ํ˜น์€ ๋ถ€์œ„๋Š” ์ž˜ ์ฐพ์•˜๋Š”๋ฐ ์ž˜๋ชป ์ธ์ง€ํ•œ๊ฑธ๊นŒ?

looking

  • MLLM ๋ ˆ์ด์–ด์—์„œ ์–ดํ…์…˜์„ ์ถ”์ถœํ•ด์„œ ์‹œ์ž‘ํ™”ํ•ด๋ณด๋ฉด!!
  • ๋น„๋ก ์ •๋‹ต์€ ํ‹€๋ ธ์ง€๋งŒ ์–ด๋””๋ฅผ ๋ด์•ผํ•˜๋Š”์ง€๋Š” ์ž˜ ์•Œ๊ณ  ์žˆ๋‹ค๋Š”๊ฒƒ์„ ์•Œ์ˆ˜ ์žˆ์ง€!!

โœ‚๏ธ 2.๊ทธ๋Ÿผ! ํ•ด๋‹น ๋ถ€์œ„๋งŒ์„ ์ œ์‹œํ•˜๋งŒ ์ž˜ ์ž‘๋™ํ• ๊นŒ??

cropping

  • 0์—์„œ ํ™•์ธํ–ˆ๋“ฏ, ์ด๋ฏธ์ง€๋ฅผ ์ž˜๋ผ ๋‹ค์‹œ ๋„ฃ๊ธฐ๋งŒ ํ•ด๋„ ์„ฑ๋Šฅ์ด ๊ธ‰์ƒ์Šน!
  • ๊ทธ๋Ÿผ, ์–ด๋–ป๊ฒŒ crop ํ•˜์ง€?!
  • 3๊ฐ€์ง€ Attention ๊ธฐ๋ฐ˜ Cropping ์ „๋žต
๋ฐฉ๋ฒ•์„ค๋ช…
Rel-Att (Relative Attention)์ •๋‹ต ์งˆ๋ฌธ vs ์ผ๋ฐ˜ ์งˆ๋ฌธ์˜ attention map์„ ๋น„๊ตํ•ด, ์ฐจ์ด์ ์„ ๊ฐ•์กฐํ•˜์—ฌ crop ์˜์—ญ ๋„์ถœ
Grad-Att (Gradient-weighted Attention)์ •๋‹ต ํ™•๋ฅ ์— ๋Œ€ํ•œ gradient๋ฅผ ํ†ตํ•ด ๋ฏผ๊ฐ ์˜์—ญ์„ ๊ฐ•์กฐํ•จ
Pure-Grad (Input Gradient)์ด๋ฏธ์ง€ ์ž์ฒด์˜ gradient๋ฅผ ํ†ตํ•ด, ํ”ฝ์…€ ๋‹จ์œ„๋กœ ์ค‘์š”ํ•œ ์˜์—ญ์„ ์ถ”์ถœํ•จ
  • crop ๋ฐฉ๋ฒ•์€?

    • ์ž…๋ ฅ: ์ด๋ฏธ์ง€ + ์งˆ๋ฌธ
    • ์ฒ˜๋ฆฌ: ์œ„ 3๊ฐ€์ง€ ๋ฐฉ๋ฒ• ์ค‘ ํ•˜๋‚˜๋กœ attention map ๊ณ„์‚ฐ โ†’ crop ์˜์—ญ ์„ค์ •
    • ์ถœ๋ ฅ: crop๋œ ์ด๋ฏธ์ง€๋ฅผ MLLM์— ๋‹ค์‹œ ๋„ฃ์–ด ๋‹ต์„ ์ƒ์„ฑ
  • ์ถ”๊ฐ€๋กœ ์ด๋ฒˆ ์—ฐ๊ตฌ์—์„œ๋Š” YOLO, CLIP, SAM๋“ฑ์„ ์‚ฌ์šฉํ•œ crop ๋ฐฉ๋ฒ•๊ณผ ์„ฑ๋Šฅ์„ ๋น„๊ตํ–ˆ๊ณ !

๊ธฐ์กด SOTA ์—ฐ๊ตฌ๋ฅผ ํ™œ์šฉํ•œ crop ๊ณผ ๋น„๊ตํ•ด๋„ ๋‚˜์˜์ง€ ์•Š์•˜๋‹ค!! crop_res

๋ฐฉ๋ฒ•ํ•œ ์ค„ ์š”์•ฝ
CLIP ViCropCLIP์„ ์‚ฌ์šฉํ•ด ์งˆ๋ฌธ๊ณผ ์˜๋ฏธ์ ์œผ๋กœ ๊ฐ€์žฅ ๊ด€๋ จ ์žˆ๋Š” ์˜์—ญ์„ ์ ์ง„์ ์œผ๋กœ ์ž˜๋ผ๊ฐ€๋ฉฐ ๋ฐ˜๋ณต ์„ ํƒํ•˜๋Š” ๋ฐฉ์‹.
YOLO ViCropYOLO๋กœ ํƒ์ง€๋œ ๊ฐ์ฒด ์˜์—ญ ์ค‘, ์งˆ๋ฌธ๊ณผ CLIP ์œ ์‚ฌ๋„๊ฐ€ ๊ฐ€์žฅ ๋†’์€ ๋ฐ”์šด๋”ฉ ๋ฐ•์Šค๋ฅผ ์„ ํƒํ•˜๋Š” ๋ฐฉ์‹.
SAM ViCropSAM์ด ์ œ๊ณตํ•˜๋Š” ์„ธ๊ทธ๋ฉ˜ํŠธ ๋งˆ์Šคํฌ๋ฅผ ๋ฐ”์šด๋”ฉ ๋ฐ•์Šค๋กœ ๋ณ€ํ™˜ํ•œ ํ›„, CLIP ์œ ์‚ฌ๋„๊ฐ€ ๊ฐ€์žฅ ๋†’์€ ์˜์—ญ์„ ์„ ํƒํ•˜๋Š” ๋ฐฉ์‹.

๐Ÿงช ์‹คํ—˜ ๋ถ„์„ ๊ฒฐ๊ณผ!!

  • ์‹คํ—˜์€ training ์—†์ด, inference ์‹œ attention ๊ธฐ๋ฐ˜ crop์„ ์ˆ˜ํ–‰ํ•˜๋Š” ๊ตฌ์กฐ
  • ํฐ ์ด๋ฏธ์ง€๋Š” ์‚ฌ์ „ cropํ•˜์—ฌ attention์ด ๋” ์ž˜ ์žกํžˆ๋„๋ก ์„ค๊ณ„ํ•จ
  • ๋‹ค์–‘ํ•œ ์งˆ๋ฌธ ์œ ํ˜•์— ๋Œ€ํ•ด crop ํ›„ ๋‹ต๋ณ€์„ ์ƒ์„ฑํ•˜๊ณ  ์„ฑ๋Šฅ ๋น„๊ต

๐Ÿ“ˆ ์ฃผ์š” ์„ฑ๊ณผ

res1

  • Rel-att ์ด๋‚˜ grad-att ๋ฐฉ์‹์œผ๋กœ ํฌ๋กญํ•œ๊ฒƒ์ด ๊ฐ€์žฅ ๊ฒฐ๊ณผ๊ฐ€ ์ข‹๋‹ค!! ํŠนํžˆ ์ž‘์€ ๊ฐ์ฒด์— ๋Œ€ํ•œ ์งˆ๋ฌธ์—์„œ!!

res2

  • ํ•ด์ƒ๋„๊ฐ€ ํฐ ์ด๋ฏธ์ง€๋Š”, ์ž˜๋ผ์„œ ์ž‘์—…ํ•˜๋Š”๊ฒŒ ํšจ๊ณผ๊ฐ€ ์ข‹์•˜๋‹ค!!

  • ์„ฑ๊ณผ ์š”์•ฝ!!
    | ์กฐ๊ฑด | ์„ฑ๋Šฅ | |โ€”โ€”|โ€”โ€”| | Full image ์ž…๋ ฅ | ์ž‘์€ ๋””ํ…Œ์ผ ์งˆ๋ฌธ์— ์ทจ์•ฝ | | Attention-guided crop โ†’ ์žฌ์ž…๋ ฅ | ๋””ํ…Œ์ผ ์งˆ๋ฌธ ์ •ํ™•๋„ ์ƒ๋‹น ํ–ฅ์ƒ | | No retraining | Zero-shot + Inference-time only ๋ฐฉ์‹ |

  • ์‹คํ—˜ ๊ฒฐ๊ณผ, ์ž‘์€ ๋””ํ…Œ์ผ์ด ์ค‘์š”ํ•œ task์—์„œ ์„ฑ๋Šฅ์ด ํ™•์—ฐํžˆ ํ–ฅ์ƒ
  • ํŠนํžˆ ๊ธฐ์กด MLLM ๋Œ€๋น„, ๊ณ ์„ฑ๋Šฅ ๋Œ€ํ˜•๋ชจ๋ธ ์—†์ด๋„ ๊ฐœ์„  ๊ฐ€๋Šฅ

โœ… ๊ฒฐ๋ก  ๋ฐ ์˜์˜

  • ์ด ๋…ผ๋ฌธ์€ MLLM์ด ์ •ํ™•ํžˆ โ€œ์–ด๋””๋ฅผ ๋ณด์•„์•ผ ํ•˜๋Š”์ง€โ€๋Š” ์ž˜ ์•„๋Š”๋ฐ,
    โ€œ๋ณด๋Š” ๋ฐฉ์‹โ€์ด ๋ถ€์กฑํ•˜๋‹ค๋Š” ์ ์„ Attention-based Cropping์œผ๋กœ ํ•ด๊ฒฐํ•จ
  • Training ์—†์ด inference๋งŒ์œผ๋กœ ์„ฑ๋Šฅ ํ–ฅ์ƒ ๊ฐ€๋Šฅํ•˜๋‹ค๋Š” ์ ์—์„œ
    ๊ฒฝ๋Ÿ‰ํ™”, ์‘์šฉ์„ฑ, ํ•ด์„๋ ฅ ์ธก๋ฉด์—์„œ ๋งค์šฐ ์‹ค์šฉ์ ์ธ ์ ‘๊ทผ
  • ๋‹ค์–‘ํ•œ downstream task (e.g. OCR, ์„ธ๋ฐ€ํ•œ ๋ฌผ์ฒด ์ธ์‹, ํŠœํ„ฐ๋ง ์‹œ์Šคํ…œ)์— ์‘์šฉ ๊ฐ€๋Šฅ

โ€œMLLMs know where to look. Letโ€™s help them see better.โ€

This post is licensed under CC BY 4.0 by the author.