Post

๐Ÿง  Notes-guided MLLM Reasoning

๐Ÿง  Notes-guided MLLM Reasoning

๐Ÿง  (English) Notes-guided MLLM Reasoning


๐Ÿง  3-Line Summary

  1. NoteMR refines external knowledge and image context to create Knowledge Notes.
  2. It identifies and extracts salient visual regions into Visual Notes to enhance perception.
  3. This approach improves KB-VQA performance on OK-VQA by 5.31%.

โš ๏ธ Two Key Limitations in Existing KB-VQA Methods

0. First of all, what is KB-VQA?

KB-VQA stands for Knowledge-Based Visual Question Answering. It involves not only understanding the image and question, but also utilizing external knowledge to answer complex, open-ended visual questions.

๐Ÿ“ฆ Representative KB-VQA Datasets

DatasetDescriptionQuestion TypeEvaluation
OK-VQARequires external knowledge for answersOpen-endedBLEU, ROUGE, answer matching
A-OKVQAOK-VQA extension with answer choicesMultiple choiceAccuracy
GQAFocused on relational reasoning and scene understandingStructured QALogical consistency, reasoning metrics
VCRVisual Commonsense ReasoningQA + RationaleChoice + Explanation Accuracy

๐Ÿ“Œ Difference between General VQA and KB-VQA

AspectGeneral VQAKB-VQA
InputImage + QuestionImage + Question + Knowledge
Example Qโ€œWhat is the cat doing?โ€โ€œWhat breed is this cat?โ€
Info NeededVisual onlyVisual + External knowledge
ModelsBLIP, GITTRiG, RAVQA-V2, NoteMR

1. External Knowledge Can Be Noisy

  • External knowledge retrieved from the web may be redundant or irrelevant, which can confuse the model and lead to incorrect answers.

Example:

Q: โ€œWhat do they call running around the bases after hitting the ball?โ€ With retrieved info: answers โ€œStealingโ€ (wrong) due to noisy text Without retrieval: correctly answers โ€œHome runโ€

error1

2. ๐Ÿ‘๏ธ Lacking Fine-Grained Visual Perception

  • MLLMs often fail to pick up on subtle visual cues, leading to hallucinations or visually irrelevant answers.

Example:

Despite a green light in the image, model answers โ€œStopโ€ due to poor visual focus.

error2


๐Ÿ” Method Summary

  • ๐Ÿง  Knowledge Note Generation Filters retrieved external knowledge + image context to generate clean and relevant knowledge notes.

  • ๐Ÿ‘๏ธ Visual Note Generation Extracts attentive visual regions informed by knowledge notes, reduces hallucinations, and strengthens perception.

  • ๐Ÿ“ˆ Achieves SOTA Performance

    • +5.31% on OK-VQA
    • +3.4% on A-OKVQA

๐Ÿงช Method Architecture

structure

1. Creating Textual Notes (N_kl)

  • Unlike past approaches that only extract knowledge, NoteMR combines external and internal knowledge to create notes.
  • External knowledge sources: Google Search Corpus + Wikidata

Top-k Selection:

  • Q: fused embedding of the question and visual features
  • D: candidate documents embedded
  • Use relevance score between Q and D to pick top-5 passages

N_kl Construction:

  • Prompt c_k, image V, top-k passages P
  • Text encoder: PreFLMR
  • Image encoder (at this stage): CLIP

2. Creating Visual Notes (N_vl)

  • Extract visual patches using GradCAM with cross-modal attention
  • Convert original image V into 576 patches (16x16)
  • Compute attention scores between N_kl tokens and visual patches
  • Use transformer attention:

    • Q = N_kl
    • K = key-weighted V
    • V = value-weighted V
  • Combine heads โ†’ generate heatmap H โ†’ apply threshold ฮป = 0.6 to mask
  • Masked visual embedding becomes final N_vl
  • Image encoder: BLIP

3. Final Answer Selection

  • Inputs: question q, image V, knowledge note N_kl, visual note N_vl
  • Format into final prompt (see below)

final_prompt

  • Generate c_0 candidate answers and choose the best (used 3 candidates in experiments)

๐Ÿ”ฎ Results

Did it perform well? Absolutely!

res

  • Outperforms all baselines on OK-VQA and A-OKVQA
  • Even beats 13B competitors using LLaVA-NeXT-8B

Ablation (Table 3):

  • Step-by-step improvements observed:

    1. MLLM only
      • Retrieved Knowledge
      • Knowledge Notes
      • Visual Notes
      • Candidate Output Selection

โœ… Conclusion

  • Introduces a modular, note-based architecture for MLLM reasoning
  • Transitions MLLM from naive answering to structured reasoning
  • High potential for use in RAG, AI tutors, and multi-hop QA systems

๐Ÿง  (ํ•œ๊ตญ์–ด) Notes-guided MLLM Reasoning


๐Ÿง  3์ค„ ์š”์•ฝ

  1. NoteMR์€ ์™ธ๋ถ€ ์ง€์‹๊ณผ ์ด๋ฏธ์ง€๋ฅผ ์ •์ œํ•ด Knowledge Note๋ฅผ ๋งŒ๋“ค๊ณ ,

  2. ์ด๋ฏธ์ง€์˜ ํ•ต์‹ฌ ์˜์—ญ๋งŒ ์ถ”์ถœํ•ด Visual Note๋กœ ์‹œ๊ฐ ์ •๋ณด ์ธ์‹์„ ๊ฐœ์„ ํ•˜๋ฉฐ,

  3. ์ด๋ฅผ ํ†ตํ•ด KB-VQA ์„ฑ๋Šฅ์„ OK-VQA ๊ธฐ์ค€ 5.31% ํ–ฅ์ƒ์‹œํ‚จ ์ตœ์‹  ๊ธฐ๋ฒ•์ด๋‹ค.


โš ๏ธ ๊ธฐ์กด ๋ฐฉ์‹์˜ ์ฃผ์š” ํ•œ๊ณ„ 2๊ฐ€์ง€

0. ์šฐ์„ !! KB-VQA๋ž€!??

KB-VQA๋Š” Knowledge-Based Visual Question Answering์˜ ์ค„์ž„๋ง๋กœ,
๋‹จ์ˆœํžˆ ์ด๋ฏธ์ง€์™€ ์งˆ๋ฌธ์„ ์ดํ•ดํ•˜๋Š” ๊ฒƒ์— ๊ทธ์น˜์ง€ ์•Š๊ณ ,
์™ธ๋ถ€ ์ง€์‹(knowledge)์„ ํ™œ์šฉํ•ด ์ •๋‹ต์„ ์ถ”๋ก ํ•ด์•ผ ํ•˜๋Š” ๊ณ ์ฐจ์› ์‹œ๊ฐ์ถ”๋ก  ๊ณผ์ œ์ž…๋‹ˆ๋‹ค.

๐Ÿ“ฆ ๋Œ€ํ‘œ KB-VQA ๋ฐ์ดํ„ฐ์…‹ ์†Œ๊ฐœ

๋ฐ์ดํ„ฐ์…‹์„ค๋ช…์งˆ๋ฌธ ์œ ํ˜•ํ‰๊ฐ€ ๋ฐฉ์‹
OK-VQAOutside Knowledge VQA. ์™ธ๋ถ€ ์ง€์‹ ์—†์ด๋Š” ๋‹ต์ด ์–ด๋ ค์šด ์งˆ๋ฌธ์„ ํฌํ•จ์˜คํ”ˆํ˜• (open-ended)BLEU, ROUGE, ์ •๋‹ต ๋งค์นญ ๋“ฑ
A-OKVQAOK-VQA์˜ ํ™•์žฅํŒ. ์ •๋‹ต ํ›„๋ณด๋ฅผ ํฌํ•จํ•˜์—ฌ ์ •๋Ÿ‰ ํ‰๊ฐ€ ๊ฐ€๋Šฅ์„ ํƒํ˜• (multiple-choice)์ •๋‹ต ์„ ํƒ ์ •ํ™•๋„
GQA๋ณต์žกํ•œ ๊ด€๊ณ„ ์ถ”๋ก ๊ณผ ์žฅ๋ฉด ์ดํ•ด ๋Šฅ๋ ฅ์„ ํ‰๊ฐ€๊ตฌ์กฐํ™”๋œ ์งˆ๋ฌธ/๋‹ต๋ณ€๋…ผ๋ฆฌ ์ •ํ™•๋„, ์ถ”๋ก  ํŒจํ„ด ๋ถ„์„
VCRVisual Commonsense Reasoning. ์ƒํ™ฉ์— ๋Œ€ํ•œ ์ƒ์‹์  ์ถ”๋ก  ์š”๊ตฌ์งˆ๋ฌธ + ์ด์œ  ์„ค๋ช…์ •๋‹ต ์„ ํƒ + rationale ํ‰๊ฐ€

๐Ÿ“Œ ์ผ๋ฐ˜ VQA์™€ KB-VQA์˜ ์ฐจ์ด์ 

ํ•ญ๋ชฉ์ผ๋ฐ˜ VQAKB-VQA
์ž…๋ ฅ์ด๋ฏธ์ง€ + ์งˆ๋ฌธ์ด๋ฏธ์ง€ + ์งˆ๋ฌธ + ์™ธ๋ถ€ ์ง€์‹
์˜ˆ์‹œ ์งˆ๋ฌธโ€œ์ด ๊ณ ์–‘์ด๋Š” ๋ฌด์—‡์„ ํ•˜๊ณ  ์žˆ๋‚˜์š”?โ€โ€œ์ด ๊ณ ์–‘์ด๋Š” ์–ด๋А ํ’ˆ์ข…์ธ๊ฐ€์š”?โ€
ํ•„์š”ํ•œ ์ •๋ณด์ด๋ฏธ์ง€ ์† ์‹œ๊ฐ ์ •๋ณด์ด๋ฏธ์ง€ + ๋ฐฐ๊ฒฝ ์ง€์‹ (e.g. ํ’ˆ์ข… ์ง€์‹)
๋Œ€ํ‘œ ๋ชจ๋ธBLIP, GIT ๋“ฑTRiG, RAVQA-V2, NoteMR ๋“ฑ
  • KB-VQA์˜ ๊ธฐ์กด ๋ฐฉ์‹์€
    1. ์ •๋ณด๋ฅผ ์ œ๊ณตํ•ด์ฃผ๊ธฐ (Retrieval Method)
      1-1. ConceptNet ๊ณผ ๊ฐ™์€ ๊ณ ์ •๋œ ์ง€์‹(fxed knowledge bases) ์‚ฌ์šฉํ•˜๊ธฐ
      1-2. open-world knowledge(Google์ด๋‚˜ Wikipedia) ์—์„œ ์ •๋ณด ๊ฐ€์ ธ์˜ค๊ธฐ
    2. LLM์„ ํ™œ์šฉํ•˜๊ธฐ (Implicit Method)
    • ์บก์…˜์„ ์ถ”๊ฐ€ํ•˜์—ฌ ๋‹ตํ•˜๊ฑฐ๋‚˜, ์ž์ฒด์ ์œผ๋กœ ์ง€์‹์„ ํ˜ธ์ถœํ•ด์„œ ๋‹ตํ•˜๊ธฐ ๋“ฑ์˜ ๊ธฐ๋ฒ•๋“ค!! (PICA, PromptCap ๋“ฑ)

1. KB-VQA์—์„œ ์™ธ๋ถ€ ์ง€์‹์€ โ€˜๋…ธ์ด์ฆˆโ€™๊ฐ€ ๋  ์ˆ˜ ์žˆ์Œ

  • MLLM์€ ์™ธ๋ถ€ ์ง€์‹์„ ํ™œ์šฉํ•ด ๋‹ต์„ ์ƒ์„ฑํ•˜์ง€๋งŒ,
    ๊ฒ€์ƒ‰๋œ ์ง€์‹์ด ์ค‘๋ณต๋˜๊ฑฐ๋‚˜ ๋ถ€์ •ํ™•ํ•œ ๊ฒฝ์šฐ,
    ์˜คํžˆ๋ ค ๋ชจ๋ธ์ด ํ˜ผ๋ž€์— ๋น ์ง€๊ณ  ์˜ค๋‹ต์„ ์ƒ์„ฑํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์˜ˆ์‹œ:

์งˆ๋ฌธ: โ€œWhat do they call running around the bases after hitting the ball?โ€
๋‹จ์ˆœ ์งˆ๋ฌธ์— ๋‹ตํ• ๋–„๋Š” Stealing์ด๋ผ๊ณ  ์ž˜๋ชป ๋‹ตํ•จ!!
์™ธ๋ถ€ ์ง€์‹์„ ๋„ฃ์€๊ฒฝ์šฐ ๊ฒ€์ƒ‰๋œ ์ง€์‹์ด ํ˜ผ๋ž€์„ ์œ ๋ฐœํ•ด โ€œStealingโ€์ด๋ผ๊ณ  ์˜ค๋‹ต์„ ๋ƒ„. ๊ทธ๋Ÿฐ๋ฐ, ์˜ค๋ฅธ์ชฝ ์ด๋ฏธ์ง€์ฒ˜๋Ÿผ, MLLM์ด ์ž์ฒด์ ์œผ๋กœ ๊ณ ๋ฏผํ•ด์„œ ๋‹ตํ•˜๋ผ๊ณ ํ•˜๋ฉด ๋ชจ๋ธ์€ โ€œHome runโ€์„ ์ž˜ ๋‹ตํ•ด!

error1

2. ๐Ÿ‘๏ธ Fine-grained ์‹œ๊ฐ ์ •๋ณด ์ฒ˜๋ฆฌ ๋Šฅ๋ ฅ ๋ถ€์กฑ

  • MLLM์˜ ๋น„์ „ ์ธ์ฝ”๋”๋Š” ์ด๋ฏธ์ง€์˜ ์„ธ๋ถ€์ ์ธ ํŠน์ง•์„ ์ž˜ ์žก์•„๋‚ด์ง€ ๋ชปํ•ฉ๋‹ˆ๋‹ค.
  • ์ด๋กœ ์ธํ•ด hallucination(์ž…๋ ฅ๊ณผ ์ƒ๊ด€์—†๋Š” ์ƒ์ƒ ์‘๋‹ต) ๋ฌธ์ œ๊ฐ€ ๋ฐœ์ƒํ•ฉ๋‹ˆ๋‹ค.

์˜ˆ์‹œ:

์ด๋ฏธ์ง€์— ์ดˆ๋ก๋ถˆ์ด ์žˆ์Œ์—๋„ ๋ชจ๋ธ์€ โ€œStopโ€์ด๋ผ๊ณ  ์‘๋‹ต โ†’
์„ธ๋ฐ€ํ•œ ์‹œ๊ฐ ์ธ์‹ ์‹คํŒจ ์‚ฌ๋ก€.

error2


๐Ÿ” ์—ฐ๊ตฌ ์š”์•ฝ

  • ๐Ÿง  Knowledge Note ์ƒ์„ฑ
    ๊ฒ€์ƒ‰๋œ ์™ธ๋ถ€ ์ง€์‹๊ณผ ์ด๋ฏธ์ง€๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ, ๋ถˆํ•„์š”ํ•˜๊ฑฐ๋‚˜ ์ค‘๋ณต๋œ ์ •๋ณด๋Š” ์ œ๊ฑฐํ•˜๊ณ 
    ์ด๋ฏธ์ง€์™€ ๊ด€๋ จ๋œ ํ•ต์‹ฌ ์ง€์‹๋งŒ ์ •๋ฆฌํ•œ ์š”์•ฝ์„ ์ƒ์„ฑํ•จ

  • ๐Ÿ‘๏ธ Visual Note ์ƒ์„ฑ
    ์ด๋ฏธ์ง€์™€ knowledge note๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ, ์ค‘์š” ์‹œ๊ฐ ์ •๋ณด์— ์ง‘์ค‘ํ•˜๋„๋ก ์œ ๋„ํ•˜์—ฌ
    ์ •ํ™•ํ•œ ์‹œ๊ฐ ์ธ์ง€ ๋Šฅ๋ ฅ ๊ฐ•ํ™” โ†’ hallucination ๋ฌธ์ œ ์™„ํ™”

  • ๐Ÿ“ˆ ์ตœ์‹  ์„ฑ๋Šฅ ๋‹ฌ์„ฑ

    • OK-VQA ๋ฐ์ดํ„ฐ์…‹์—์„œ 5.31% ์„ฑ๋Šฅ ํ–ฅ์ƒ
    • A-OKVQA ๋ฐ์ดํ„ฐ์…‹์—์„œ 3.4% ์„ฑ๋Šฅ ํ–ฅ์ƒ
      โ†’ ์‹คํ—˜์„ ํ†ตํ•ด NoteMR์˜ ํšจ๊ณผ์„ฑ ์ž…์ฆ

๐Ÿงช ์—ฐ๊ตฌ ๋ฐฉ๋ฒ•๋ก 

structure

  • 3๋‹จ๊ณ„๋กœ ๊ตฌ์„ฑ!!
    1. ํ…์ŠคํŠธ ๋…ธํŠธ ๋งŒ๋“ค๊ธฐ,
    2. ํ…์ŠคํŠธ ๋…ธํŠธ๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ๋น„์ฃผ์–ผ ๋…ธํŠธ ๋งŒ๋“ค๊ธฐ
    3. 2๊ฐœ์˜ ๋…ธํŠธ, ์ด๋ฏธ์ง€, ์งˆ๋ฌธ์„ ๋„ฃ๊ณ  ํ›„๋ณด๋“ค ์ƒ์„ฑ + ํ›„๋ณด๋“ค ์ค‘ ๋‹ต ์„ ์ •

1. ํ…์ŠคํŠธ ๋…ธํŠธ(N_kl) ๋งŒ๋“ค๊ธฐ

  • ๊ธฐ์กด ์—ฐ๊ตฌ๋“ค์€ ์™ธ๋ถ€์—์„œ๋“  ๋‚ด๋ถ€์—์„œ๋“  ์ง€์‹์„ ์ถ”์ถœํ•˜๋ ค๊ณ ๋งŒ ํ–ˆ์ง€๋งŒ
  • ์—ฌ๊ธฐ์„œ๋Š” ์™ธ๋ถ€ ๋‚ด๋ถ€๋ฅผ ๊ฒฐํ•ฉํ•ด์„œ ๋…ธํŠธ๋ฅผ ์ƒ์„ฑํ•˜๋Š”๊ฒƒ์— ์ค‘์ ์„ ๋‘์—ˆ๋‹ค!!
    • ์™ธ๋ถ€ ์ง€์‹์˜ ๊ฒฝ์šฐ๋Š” MLLM์ด ๋‚ด๋ถ€ ์ง€์‹์„ ์ž˜ ์ถ”์ถœํ•˜๋Š”๋ฐ ํ™œ์šฉํ–ˆ๊ณ , ์ด๋ฅผ ํ†ตํ•ด ๋‚ด๋ถ€์ง€์‹๊ณผ์˜ ์‹œ๋„ˆ์ง€๋ฅผ ์ผ์œผ์ผฐ๋‹ค.
    • ์™ธ๋ถ€์ง€์‹์€ ๊ธฐ์กด์—ฐ๊ตฌ๋“ค ์ฒ˜๋Ÿผ Google Search Corpus๋ž‘ Wikidata ์‚ฌ์šฉํ–ˆ๋‹ค (์™ธ๋ถ€์ง€์‹์ด ๊ผญ ํ•„์š”ํ•œ OK-VQA๋‚˜ A-OKVQA)
  • ํ…์ŠคํŠธ ๋…ธํŠธ ์žฌ๋ฃŒ ์„ ์ •ํ•˜๊ธฐ: tok-K ๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉ
    • Q(์ฟผ๋ฆฌ์ž„๋ฒ ๋”ฉ) ์ƒ์„ฑ : ์งˆ๋ฌธ ํ…์ŠคํŠธ์ž„๋ฒ ๋”ฉ ๊ณผ ์ด๋ฏธ์ง€๋ฅผ ๋ฒกํ„ฐํ™”ํ•˜์—ฌ ํ…์ŠคํŠธ์ž„๋ฒ ๋”ฉ์œผ๋กœ ์ •๋ ฌํ•œ๊ฒƒ์„ ํ•ฉ์นœ๋‹ค!
    • D(๋ฌธ์„œ์ž„๋ฒ ๋”ฉ) ์ƒ์„ฑ : Wikidata ๊ฐ™์€ document๋ฅผ ํ…์ŠคํŠธ ์ž„๋ฒ ๋”ฉํ•œ๋‹ค
    • Q์™€ D์‚ฌ์ด์˜ ๊ด€๋ จ์„ฑ์ ์ˆ˜๋ฅผ ๊ตฌํ•ด์„œ, ๊ฐ€์žฅ ์ ์ˆ˜๊ฐ€ ๋†’์€ k ๊ฐœ ๋ฌธ์„œ๋ฅผ ๋ฝ‘๋Š”๋‹ค!! (์‹คํ—˜์—์„œ๋Š” Top-k๋ฅผ 5๋กœํ•จ)
  • ์ง€์‹ ๋…ธํŠธ(N_kl) ์ƒ์„ฑ : ์™ธ๋ถ€ ์ง€์‹(P)๋กœ MLLM ๋‚ด๋ถ€ ์ง€์‹์„ ์ตœ๋Œ€ํ•œ ํ™œ์šฉํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•œ๋‹ค.
    • ์ด๋ฅผ ํ†ตํ•ด ์ž˜๋ชป๋œ ์™ธ๋ถ€์ง€์‹์œผ๋กœ ์ธํ•œ ์žก์Œ์„ ๋ฐฉ์ง€ํ•œ๋‹ค.
    • N_kl ์ƒ์„ฑ ์žฌ๋ฃŒ
      • c_k : ์‚ฌ์ „ ์ค€๋น„๋œ ํ”„๋กฌํฌํŠธ
      • V : ์˜ค๋ฆฌ์ง€๋‚  ์ด๋ฏธ์ง€
      • P : ์„ ์ •๋œ top k ๊ฐœ์˜ ๋ฐ์ดํ„ฐ
        ck1
  • ํ…์ŠคํŠธ์ธ์ฝ”๋”๋Š” PreFLMR ์ด๋ฏธ์ง€ ์ธ์ฝ”๋”๋Š” CLIP ์‚ฌ์šฉ!!

2. ๋น„์ฃผ์–ผ ๋…ธํŠธ(N_vl) ๋งŒ๋“ค๊ธฐ

  • ์ด๋ฏธ์ง€์˜ ์ค‘์š”ํ•œ ํŒจ์น˜๋ฅผ ์„ ๋ณ„ํ•˜๊ธฐ์œ„ํ•ด์„œ ํฌ๋กœ์Šค๋ชจ๋‹ฌ ๋งคํŠธ๋ฆญ์Šค๋ฅผ ํ™œ์šฉํ–ˆ๋‹ค!! GradCAM
  • ์ค‘๋ณต ์ •๋ณด๋ฅผ ์ œ๊ฑฐํ•˜๊ธฐ ์œ„ํ•ด, ์šฐ๋ฆฌ๋Š” ์ด๋ฏธ์ง€์˜ ์ง‘์ค‘๋ถ€์œ„๋ฅผ ์œ ์ง€ํ–ˆ๊ณ , ์–ธ์–ด๋ชจ๋ธ์ด ์งˆ๋ฌธ๊ณผ ๊ด€๊ณ„๋˜๋Š”๊ณณ์— ์ง‘์ค‘ํ•˜์—ฌ ํ• ๋ฃจ์‹œ๋‚ด์ด์…˜์„ ๊ฒฝ๊ฐ์‹œ์ผฐ๋‹ค
    • ๊ตฌ์ฒด์ ์œผ๋กœ๋Š”, ์˜ค๋ฆฌ์ง€๋‚  ์ด๋ฏธ์ง€ V๋ฅผ M๊ฐœ์˜ ํŒจ์น˜๋กœ ๋งŒ๋“ค๊ณ , ๊ฐ ํŒจ์น˜์˜ feature๋ฅผ ๊ตฌํ–ˆ๋‹ค.
      • ํŒจ์น˜์‚ฌ์ด์ฆˆ๋Š” 16X16 ํŒจ์น˜๊ฐ€ 576๊ฐœ์ธ๊ฒƒ ์œผ๋กœํ•จ
    • ์ง€์‹๋…ธํŠธ N_kl์„ ํ† ํฐํ™”ํ•ด์„œ ๊ฐ๊ฐ์˜ ํŒจ์น˜์™€ ํ† ํฐ๊ฐ„์˜ ๋ฉ€ํ‹ฐํ—ค๋“œ ํฌ๋กœ์Šค๋ชจ๋‹ฌ ์–ดํ…์…˜ ๊ฐ’์„ ๊ตฌํ–ˆ๋‹ค
    • ์ด๋•Œ์˜ ๋ฉ€ํ‹ฐํ—ค๋“œ ํŠธ๋žœ์Šคํฌ๋จธ ๊ตฌ์กฐ!! (i๊ฒŒ์˜ ํ—ค๋“œ)
      • Q : ์ง€์‹๋…ธํŠธ N_kl
      • K : Key weight ํ–‰๋ ฌ X ์ด๋ฏธ์ง€ ํŒจ์น˜ V
      • V : value weight ํ–‰๋ ฌ X ์ด๋ฏธ์ง€ ํŒจ์น˜ V
    • i ๊ฐœ๋ฅผ ๋ชจ๋‘ ๊ฒฐํ•ฉํ•ด์„œ H๋ฅผ ๊ตฌํ•˜๊ณ !!
    • ์ž„๊ณ„๊ฐ’ ฮป ๋ฅผ ๋„˜๋Š” ๋ถ€๋ถ„๋งŒ์„ ๋‚จ๊ฒจ์„œ ๋งˆ์Šคํฌ ์ƒ์„ฑ!! (0.6์œผ๋กœํ•จ)
    • ์˜ค๋ฆฌ์ง€๋‚  ์ด๋ฏธ์ง€ V dot Mask ํ•ด์„œ ์ตœ์ข… ๋น„์ฅฌ์–ผ ๋…ธํŠธ N_vl ์ƒ์„ฑ
  • ์ด๋ฏธ์ง€ ์ธ์ฝ”๋”๋Š” BLIP ์‚ฌ์šฉ!!

3. ์ตœ์ข… ๋‹ต๋ณ€ ์„ ํƒ!!

  • ์ง€๊ธˆ๊นŒ์ง€ ์ค€๋น„๋œ๊ฒƒ: ์งˆ๋ฌธ q, ์˜ค๋ฆฌ์ง€๋‚  ์ด๋ฏธ์ง€ V, ์ง€์‹ ๋…ธํŠธ N_kl , ๋น„์ฅฌ์–ผ ๋…ธํŠธ N_vl
  • ์ค€๋น„๋œ ํ”„๋กฌํฌํŠธ์— ์ž˜ ๋…น์—ฌ์„œ ๋„ฃ๋Š”๋‹ค!!! final_prompt
  • ๊ทธ๋ ‡๊ฒŒ co ๊ฐœ์˜ ํ›„๋ณด ๋‹ต๋ณ€์„ ๋งŒ๋“ ๋‹ค์Œ!! ๊ทธ์ค‘์—์„œ ์ œ์ผ ์ข‹์€ ๋‹ต๋ณ€์„ ๋ฝ‘๋Š”๋‹ค!!
  • ๋’ค์˜ ์‹คํ—˜๋ถ€๋ถ„์„ ๋ณด๋ฉด ํ›„๋ณด๋Š” 3๊ฐœ๋กœํ–ˆ์Œ!!

์‹คํ—˜ ๊ฒฐ๊ณผ!!

๊ฒฐ๊ตญ ์ ์ˆ˜๊ฐ€ ์ข‹์•˜๊ฒ ์ฃ !?ใ…Žใ…Ž
res

  • OK-VQA ๋ฐ A-OKVQA ๋ชจ๋‘์—์„œ ๊ฐ€์žฅ ์ข‹์€ ๊ฒฐ๊ณผ๋ฅผ ๋ณด์˜€์Œ!!
    • LLaVa-NeXT-8b ์—์„œ ์ตœ๊ณ ์˜ ์„ฑ๋Šฅ!!
    • ๋‹ค๋ฅธ ์—ฐ๊ตฌ์—์„œ์˜ 13B ๋ณด๋‹ค๋„ ์„ฑ๋Šฅ์ด ์ข‹์•˜๋‹ค!!
  • ๋ชจ๋“ˆ๋ณ„๋กœ ๋ณด๊ธฐ! (Ablation Study)
    • ์ด๋ฏธ์ง€์˜ 3๋ฒˆ Table!! 5๋‹จ๊ณ„๋กœ ๋‚˜๋ˆ„์—ˆ๋‹ค!
    • 1๋‹จ๊ณ„: ๊ทธ๋ƒฅ MLLM๋งŒ ๊ฐ€์ง€๊ณ  ๋ฌธ์ œํ’€๊ธฐ
    • 2๋‹จ๊ณ„: ๊ฒ€์ƒ‰๋œ ์ง€์‹ ์ถ”๊ฐ€
    • 3๋‹จ๊ณ„: ์ง€์‹๋…ธํŠธ๋กœ ์ถ”๊ฐ€
    • 4๋‹จ๊ณ„: ์ง€์‹๋…ธํŠธ + ๋น„์ฃผ์–ผ๋…ธํŠธ
    • 5๋‹จ๊ณ„: 4๋‹จ๊ณ„๋กœ ์—ฌ๋Ÿฌ๊ฐœํ•œ๋’ค ์„ ์ •
    • ๋‹จ๊ณ„๋ณ„๋กœ ๋ชจ๋‘ ๋ฐœ์ „ํ•จ์„ ํ™•์ธํ• ์ˆ˜ ์žˆ์—ˆ๋‹ค!!

โœ… ๊ฒฐ๋ก 

  • ๋…ธํŠธ ๊ธฐ๋ฐ˜ ๊ตฌ์กฐ๋ฅผ ํ†ตํ•ด ๋ชจ๋ธ์˜ ์ถ”๋ก  ๊ณผ์ •์„ ๋‹จ๊ณ„ํ™”ํ•˜๊ณ ,
  • ๋‹จ์ˆœ ์‘๋‹ตํ˜• ๋ชจ๋ธ์—์„œ ์‚ฌ๊ณ -๊ธฐ๋ฐ˜ reasoning ๋ชจ๋ธ๋กœ ๋ฐœ์ „ ๊ฐ€๋Šฅ
  • RAG, AI ํŠœํ„ฐ, ๋ฉ€ํ‹ฐํ™‰ ์งˆ์˜์‘๋‹ต ์‹œ์Šคํ…œ์— ์‘์šฉ ๊ฐ€๋Šฅ์„ฑ ๋†’์Œ
This post is licensed under CC BY 4.0 by the author.