Post

๐Ÿ‘€ Visual Attention Sink & VAR ๋…ผ๋ฌธ ๊ณต๋ถ€ (ICLR 2025)

๐Ÿ‘€ Visual Attention Sink & VAR ๋…ผ๋ฌธ ๊ณต๋ถ€ (ICLR 2025)

๐Ÿ‘€ SEE WHAT YOU ARE TOLD โ€” Visual Attention Sink in LMMs

๋…ผ๋ฌธ: SEE WHAT YOU ARE TOLD: VISUAL ATTENTION SINK IN LARGE MULTIMODAL MODELS
์ €์ž: Seil Kang, Jinyeong Kim, Junhyeok Kim, Seong Jae Hwang
ํ•™ํšŒ: ICLR 2025
ํ‚ค์›Œ๋“œ: Visual Attention Sink, LMM, LLaVA, VAR (Visual Attention Redistribution)
ํ•œ ์ค„ ์š”์•ฝ:
LMM์€ ์ด๋ฏธ์ง€๋ฅผ โ€œ์ œ๋Œ€๋กœ ๋ณด์ง€โ€ ๋ชปํ•˜๊ณ , ์ด๋ฏธ์ง€ ์•ˆ์˜ ์“ฐ๋ ˆ๊ธฐํ†ต ํ† ํฐ(visual sink) ์— ์˜๋ฏธ ์—†์ด attention ์„ ์Ÿ๋Š”๋‹ค.
์ด ํ† ํฐ๋“ค์„ ๋ถ„์„ํ•ด์„œ ์“ธ๋ชจ์—†๋Š” attention์„ ๊ฑท์–ด๋‹ค๊ฐ€ ์ง„์งœ ์ค‘์š”ํ•œ ์ด๋ฏธ์ง€ ํŒจ์น˜์— ์žฌ๋ถ„๋ฐฐํ•ด ์ฃผ๋ฉด,
์ถ”๊ฐ€ ํ•™์Šต ์—†์ด๋„ ๋‹ค์–‘ํ•œ ๋น„์ „-์–ธ์–ด ๋ฒค์น˜๋งˆํฌ ์„ฑ๋Šฅ์ด ์˜ค๋ฅธ๋‹ค!


๐Ÿงฉ ๋ฌธ์ œ ์ •์˜: LMM, ์ •๋ง ์ด๋ฏธ์ง€๋ฅผ ์ž˜ ๋ณด๊ณ  ์žˆ๋‚˜?

LLaVA, Qwen2-VL, InternVL ๊ฐ™์€ Large Multimodal Models(LMMs)๋Š”
์ด๋ฏธ์ง€ ์ธ์ฝ”๋” + LLM ๋””์ฝ”๋” ๊ตฌ์กฐ๋กœ ๋™์ž‘ํ•œ๋‹ค.

ํ…์ŠคํŠธ ํ† ํฐ์ด ์‹œ๊ฐ ์ •๋ณด๋ฅผ ์ฝ์–ด์˜ฌ ๋•Œ ์‚ฌ์šฉํ•˜๋Š” ํ†ต๋กœ๊ฐ€ ๋ฐ”๋กœ Textโ€“Image Attention.
์ด๋ก ์ ์œผ๋กœ๋Š”:

"bird" ํ† ํฐ โ†’ ์ƒˆ๊ฐ€ ์žˆ๋Š” ํŒจ์น˜์—๋งŒ ๊ฐ•ํ•˜๊ฒŒ attention

์ด์–ด์•ผ ํ•˜์ง€๋งŒ, ๋…ผ๋ฌธ์—์„œ LLaVA-1.5-7B์˜ attention map ์„ ์‹œ๊ฐํ™”ํ•ด ๋ณด๋‹ˆ

err

  • "bird", "banana", "knife" ๋“ฑ ์–ด๋–ค ํ…์ŠคํŠธ ํ† ํฐ์ด๋“  ์ƒ๊ด€์—†์ด
  • ํ•ญ์ƒ ๊ฐ™์€ ๋ช‡ ๊ฐœ์˜ ์ด๋ฏธ์ง€ ํ† ํฐ(ํŒจ์น˜)์— ๋†’์€ attention
  • ์‹ฌ์ง€์–ด ๊ทธ ํŒจ์น˜๋“ค์€ ์งˆ๋ฌธ๊ณผ ์ „ํ˜€ ์ƒ๊ด€์—†๋Š” ๋ฐฐ๊ฒฝ ์˜์—ญ

์ฆ‰,

LMM๋„ ์–ธ์–ด ๋ชจ๋ธ์ฒ˜๋Ÿผ โ€œํŠน์ • ํ† ํฐ์— attention์„ ๋ฒ„๋ฆฌ๋Š”โ€ ํ˜„์ƒ์„ ๊ฐ€์ง„๋‹ค.
๋‹ค๋งŒ ์ด๋ฒˆ์—๋Š” ์–ธ์–ด๊ฐ€ ์•„๋‹ˆ๋ผ ์ด๋ฏธ์ง€ ํ† ํฐ ์ชฝ์— ์ƒ๊ธด sink๋ผ๋Š” ์ ์ด ํฌ์ธํŠธ!!

๋…ผ๋ฌธ์€ ์ด ํ˜„์ƒ์„ Visual Attention Sink๋ผ๊ณ  ๋ถ€๋ฅธ๋‹ค.


๐Ÿ” ํ•ต์‹ฌ ๊ด€์ฐฐ 1: Visual Sink Token์˜ ์ •์ฒด

๋…ผ๋ฌธ์˜ ์ฒซ ๋ฒˆ์งธ ํ•ต์‹ฌ ๊ธฐ์—ฌ๋Š”

โ€œ์ด ์ด์ƒํ•œ ํ† ํฐ๋“ค์ด ๋„๋Œ€์ฒด ๋ญ๋ƒ?โ€ ๋ฅผ
hidden state level์—์„œ ํ•ด๋ถ€ํ•œ ๊ฒƒ.

1) Massive Activation in Sink Dimensions

์–ธ์–ด ๋ชจ๋ธ์—์„œ ์•Œ๋ ค์ง„ ํ˜„์ƒ:

  • BOS, . , \n ๋“ฑ ์˜๋ฏธ๊ฐ€ ๊ฑฐ์˜ ์—†๋Š” ํ† ํฐ์— ๋Œ€ํ•ด
  • ํŠน์ • hidden ์ฐจ์›๋“ค๋งŒ ๋น„์ •์ƒ์ ์œผ๋กœ ํฐ ๊ฐ’(โ€œmassive activationโ€) ์„ ๊ฐ–๋Š”๋‹ค.
  • ์ด ์ฐจ์› index ์ง‘ํ•ฉ์„ ( D_{sink} )๋ผ๊ณ  ๋ถ€๋ฅธ๋‹ค.

์ €์ž๋“ค์€ LLaVA-1.5-7B, LLaVA-1.5-13B, Qwen2-VL-7B, InternVL2-8B ๋“ฑ ์—ฌ๋Ÿฌ LMM์—์„œ

  • ์ด๋ฏธ์ง€ ํ† ํฐ ์ค‘ ์ผ๋ถ€๊ฐ€ BOS์™€ ๋™์ผํ•œ sink dimension์—์„œ massive activation ์„ ๋ณด์ธ๋‹ค๋Š” ๊ฒƒ์„ ๋ฐœ๊ฒฌ.
  • ์ฆ‰, โ€œ์‹œ๊ฐ์  sink ํ† ํฐโ€๋„ ์–ธ์–ด sink ํ† ํฐ๊ณผ ๋˜‘๊ฐ™์€ ํŒจํ„ด์„ ๊ฐ€์ง„๋‹ค (๋…ผ๋ฌธ Fig.2, Fig.7).

๊ทธ๋ž˜์„œ Visual Sink Token์„ ๋‹ค์Œ์ฒ˜๋Ÿผ ์ •์˜:

  • hidden state ( x )์—์„œ
  • sink dimension value ( \phi(x) )๊ฐ€ threshold ( \tau ) ์ด์ƒ์ธ ํ† ํฐ
    โ†’ sink token ์œผ๋กœ ๋ถ„๋ฅ˜ (๋…ผ๋ฌธ์—์„œ๋Š” ( \tau = 20 ) ์‚ฌ์šฉ)

2) Background์— ๋ชฐ๋ ค ์žˆ๋Š” ํ† ํฐ๋“ค

Segmentation ๋ฐ์ดํ„ฐ์…‹(PASCAL VOC, COCO) ์œ„์—์„œ ์œ„์น˜๋ฅผ ๋น„๊ตํ•ด๋ณด๋ฉด (Table 6):

  • Visual Sink Token์˜ 90% ์ด์ƒ์ด
    ๊ฐ์ฒด๊ฐ€ ์•„๋‹Œ ๋ฐฐ๊ฒฝ ์˜์—ญ์— ์œ„์น˜
  • ์˜๋ฏธ ์žˆ๋Š” ๊ฐ์ฒด์™€ ๊ด€๋ จ๋œ ํ† ํฐ๋“ค์€ ๋Œ€๋ถ€๋ถ„ non-sink

โ†’ ViT์—์„œ์˜ โ€œbackground sink / registerโ€ ํ˜„์ƒ๊ณผ ์™„์ „ํžˆ ํ‰ํ–‰ํ•œ ๊ทธ๋ฆผ.


๐Ÿ”ง ํ•ต์‹ฌ ๊ด€์ฐฐ 2: Sink Token์€ โ€œ๊ฑฐ์˜ ์•„๋ฌด๊ฒƒ๋„ ํ•˜์ง€ ์•Š๋Š”๋‹คโ€

โ€œattention ์€ ๋†’์€๋ฐ, ์ง„์งœ๋กœ ์œ ์šฉํ•œ๊ฐ€?โ€๋ฅผ ํ™•์ธํ•˜๊ธฐ ์œ„ํ•ด ๋‘ ๊ฐ€์ง€ ์‹คํ—˜์„ ํ•œ๋‹ค.

1) Token Masking ์‹คํ—˜ (Attention Knockout)

  • Visual Sink Token โ†’ Text๋กœ ๊ฐ€๋Š” attention์„ ์™„์ „ํžˆ ์ฐจ๋‹จ
    (ํ•ด๋‹น ํ† ํฐ์—์„œ ํ…์ŠคํŠธ ํ† ํฐ์œผ๋กœ์˜ ฮฑ๋ฅผ 0์œผ๋กœ ๋งŒ๋“œ๋Š” ๋ฐฉ์‹)
  • ๋น„๊ต๊ตฐ: ๊ฐ™์€ ๊ฐœ์ˆ˜์˜ ์ด๋ฏธ์ง€ ํ† ํฐ์„ ๋žœ๋ค์œผ๋กœ ๋งˆ์Šคํ‚น

๊ฒฐ๊ณผ (Fig.3(b)):

  • Sink Token ๋งˆ์Šคํ‚น โ†’ ์„ฑ๋Šฅ ๊ฑฐ์˜ ๋ณ€ํ™” ์—†์Œ
  • ๋žœ๋ค ํ† ํฐ ๋งˆ์Šคํ‚น โ†’ ์„ฑ๋Šฅ ๋š ๋–จ์–ด์ง

โ†’ Sink Token์€ attention์€ ๋งŽ์ด ๋ฐ›์ง€๋งŒ, ์ •๋ณด ๊ธฐ์—ฌ๋„๋Š” ๊ฑฐ์˜ 0.

2) Residual Stream Contribution ๋ถ„์„

๊ฐ visual token์ด residual stream์— ์ฃผ๋Š” ๊ธฐ์—ฌ๋Ÿ‰์„

[ |\alpha_{i,j} \, x_j W_{OV}| ]

์œผ๋กœ ์ •์˜ํ•˜๊ณ , sink vs non-sink ํ‰๊ท ๊ฐ’์„ ๋น„๊ต (Fig.3(c)):

  • Visual Sink Token์˜ residual contribution์€
    ๋‹ค๋ฅธ ํ† ํฐ๋“ค์— ๋น„ํ•ด ํ›จ์”ฌ ์ž‘๋‹ค.

์ฆ‰,

Visual Attention Sink = โ€œ์ด๋ฏธ์ง€ ๋ฒ„ํผ/์“ฐ๋ ˆ๊ธฐํ†ต ์—ญํ• ์„ ํ•˜๋Š” ํ† ํฐ๋“คโ€
์ •๋ณด๋Š” ์—†๋Š”๋ฐ attention๊ณผ hidden activation๋งŒ ํฌ๋‹ค.


๐Ÿ’ก ์•„์ด๋””์–ด: ์“ธ๋ชจ์—†๋Š” Attention, ์˜ˆ์‚ฐ์œผ๋กœ ์žฌํ™œ์šฉํ•˜์ž

์—ฌ๊ธฐ๊นŒ์ง€์˜ ๊ด€์ฐฐ์„ ํ•œ ์ค„๋กœ ์ •๋ฆฌํ•˜๋ฉด:

โ€œ์‹œ๊ฐ์  sink ํ† ํฐ์— ์Ÿ์•„์ง€๋Š” attention์€ ๊ฑฐ์˜ ๋‚ญ๋น„๋‹ค.โ€

๋™์‹œ์—, ๋‹ค๋ฅธ ์—ฐ๊ตฌ๋“ค(Chen 2024, Liu 2024 ๋“ฑ)์—์„œ๋Š”

  • LMM์ด ํ…์ŠคํŠธ์— ๋น„ํ•ด ์ด๋ฏธ์ง€์— ๋„ˆ๋ฌด ์ ์€ attention์„ ์ฃผ๊ณ 
  • ๊ทธ ๊ฒฐ๊ณผ object hallucination, spatial reasoning ์‹คํŒจ ๋“ฑ์ด ๋ฐœ์ƒํ•œ๋‹ค๊ณ  ๋ณด๊ณ ํ–ˆ๋‹ค.

๊ทธ๋ž˜์„œ ๋…ผ๋ฌธ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ƒ๊ฐํ•œ๋‹ค:

โ€œSink ํ† ํฐ์œผ๋กœ ํ˜๋Ÿฌ๊ฐ€๋Š” attention์„ ๊ฑท์–ด๋‹ค๊ฐ€
์ง„์งœ ์ด๋ฏธ์ง€ ์ •๋ณด(visual non-sink)์— ๋‹ค์‹œ ๋ฟŒ๋ฆฌ๋ฉด ์–ด๋–จ๊นŒ?โ€

๊ทธ๋ฆฌ๊ณ  ์ด ์•„์ด๋””์–ด๋ฅผ ๊ตฌ์ฒดํ™”ํ•œ ๊ฒƒ์ด ๋ฐ”๋กœ

VAR: Visual Attention Redistribution


๐Ÿš€ ๋ฐฉ๋ฒ•: VAR (Visual Attention Redistribution)

VAR๋Š” ์™„์ „ํžˆ training-freeํ•œ ๋ฐฉ๋ฒ•์ด๋‹ค.

  1. ์ด๋ฏธ์ง€์— ์ง‘์ค‘ํ•˜๋Š” head(์ด๋ฏธ์ง€-์ค‘์‹ฌ head, Image-Centric Head)๋ฅผ ์„ ํƒํ•˜๊ณ 
  2. ๊ทธ head์—์„œ๋งŒ attention์„ ์žฌ๋ถ„๋ฐฐํ•œ๋‹ค.

1๋‹จ๊ณ„: Image-Centric Head ์„ ํƒ

๋ชจ๋“  head์— ์†๋Œ€๋ฉด ๋ชจ๋ธ ๊ตฌ์กฐ๋ฅผ ๊นจ๋ฒ„๋ฆด ์ˆ˜ ์žˆ์œผ๋‹ˆ,
โ€œ์ด๋ฏธ์ง€ ์ •๋ณด๋ฅผ ์ž˜ ๋ณด๊ณ  ์žˆ๋Š” head๋งŒ ๊ณจ๋ผ์„œ ์ˆ˜์ •โ€ ํ•˜๋Š” ๊ฒŒ ์ค‘์š”ํ•˜๋‹ค.

๋…ผ๋ฌธ์€ Visual Non-Sink Ratio๋ผ๋Š” ์ง€ํ‘œ๋ฅผ ์ •์˜ํ•œ๋‹ค (Eq.3):

[ r_{i}^{\ell,h} = \frac{\sum_{j \in I_{vis} \setminus I^{\ell}{q,vis}} \alpha{i,j}^{\ell,h}} {\sum_{j \in I_{vis}} \alpha_{i,j}^{\ell,h}} ]

  • ๋ถ„์ž: visual non-sink ํ† ํฐ์— ๊ฐ€๋Š” attention ํ•ฉ
  • ๋ถ„๋ชจ: ์ „์ฒด visual ํ† ํฐ(attention to all visual) ํ•ฉ

์ด ๋น„์œจ์ด ๋†’์€ head์ผ์ˆ˜๋ก

โ€œsink garbage ๋ง๊ณ  ์‹ค์ œ ์ด๋ฏธ์ง€ ํŒจ์น˜์— ๋” ์ง‘์ค‘ํ•˜๋Š” headโ€

๋ผ๊ณ  ๋ณผ ์ˆ˜ ์žˆ๋‹ค. ๋…ผ๋ฌธ Fig.4๋ฅผ ๋ณด๋ฉด:

  • non-sink ratio๊ฐ€ ๋†’์€ head โ†’ ์งˆ๋ฌธ๊ณผ ๊ด€๋ จ๋œ ๊ฐ์ฒด ๋ถ€๋ถ„์— ์ง‘์ค‘
  • ratio๊ฐ€ ๋‚ฎ์€ head โ†’ ์ด๋ฏธ์ง€ ์ „๋ฐ˜์— ํ๋ฆฌ๊ฒŒ ๋ฟŒ๋ ค์ง„ attention

๊ทธ๋ž˜์„œ

  • ( r_{i}^{\ell,h} \ge \rho ) ์ธ head๋ฅผ
    Image-Centric Head (ICH) ๋กœ ์„ ํƒ (hyperparam ( \rho ))

2๋‹จ๊ณ„: Sink โ†’ Non-Sink๋กœ Attention ์žฌ๋ถ„๋ฐฐ

์„ ํƒ๋œ ICH์— ํ•œํ•ด์„œ๋งŒ, ๊ฐ text token i์— ๋Œ€ํ•ด:

  1. sink token๋“ค์— ๊ฐ€๋Š” attention ์ค‘ ์ผ๋ถ€ ๋น„์œจ p๋ฅผ ๊ฑท์–ด์„œ
    attention budget (\Omega) ๋กœ ๋ชจ์€๋‹ค.

    • sink ์ชฝ: (\alpha_{i,j}^{q} = (1-p)\alpha_{i,j})
    • budget: (\Omega = p \sum_{j \in I_q} \alpha_{i,j})
  2. ์ด budget์„ visual non-sink ํ† ํฐ๋“ค์— ๋น„๋ก€ํ•ด์„œ ์žฌ๋ถ„๋ฐฐ (Eq.4):

    [ \alpha^{q}{i,j} = \alpha{i,j} + \Omega \cdot \frac{\alpha_{i,j}}{\sum_{k \in I_{vis} \setminus I_{q,vis}} \alpha_{i,k}} \quad (j \in I_{vis} \setminus I_{q,vis}) ]

  3. ์ „์ฒด attention ํ•ฉ์€ ์—ฌ์ „ํžˆ 1๋กœ ์œ ์ง€๋จ
    โ†’ ํ™•๋ฅ  ๋ถ„ํฌ ์„ฑ์งˆ์€ ๋ณด์กด

์ด ๊ณผ์ •์„ ๋ชจ๋“  ํ…์ŠคํŠธ ํ† ํฐ(์งˆ๋ฌธ, ์‹œ์Šคํ…œ ํ”„๋กฌํ”„ํŠธ, ์ƒ์„ฑ๋œ ๋‹ต๋ณ€ ํ† ํฐ)์— ์ ์šฉํ•˜๋˜,

  • ๋งˆ์ง€๋ง‰ ๋ ˆ์ด์–ด๋Š” ๊ฑด๋“œ๋ฆฌ์ง€ ์•Š๋Š”๋‹ค (๋งˆ์ง€๋ง‰ ๋ ˆ์ด์–ด๋Š” ํŠน์ˆ˜ํ•œ ์—ญํ• ์„ ๊ฐ€์ง„๋‹ค๋Š” ์„ ํ–‰์—ฐ๊ตฌ๋ฅผ ๋”ฐ๋ฆ„).

๐Ÿ“Š ์‹คํ—˜ ๊ฒฐ๊ณผ: โ€œ์ด๋ฏธ์ง€๋งŒ ๋” ์ž˜ ๋ดค์„ ๋ฟ์ธ๋ฐโ€ฆโ€

๋…ผ๋ฌธ์€ LLaVA-1.5, VILA, Qwen2-VL, InternVL2 ๋“ฑ ๋‹ค์–‘ํ•œ LMM์— VAR๋ฅผ ๋ถ™์—ฌ์„œ ํ‰๊ฐ€ํ•œ๋‹ค.

1) General Vision-Language Benchmarks (Table 1)

์˜ˆ: LLaVA-1.5-7B

  • VQAv2: 78.5 โ†’ 78.6
  • GQA: 62.0 โ†’ 63.5
  • VizWiz: 50.0 โ†’ 53.7
  • MM-Vet: 31.1 โ†’ 33.7

์žฌ๋ฐŒ๋Š” ํฌ์ธํŠธ:

LLaVA-1.5-7B + VAR ๊ฐ€
๋ฒ ์ด์Šค LLaVA-1.5-13B๋ฅผ ์ผ๋ถ€ ๋ฒค์น˜๋งˆํฌ์—์„œ ์ด๊น€
โ†’ โ€œ๋ชจ๋ธ ํฌ๊ธฐโ€ ๋Œ€์‹  โ€œ๋‚ด๋ถ€ attention ์กฐ์ •โ€๋งŒ์œผ๋กœ๋„ ๊ฝค ํฐ ์ด๋“์„ ์–ป์„ ์ˆ˜ ์žˆ๋‹ค๋Š” ์‹œ์‚ฌ์ .

2) Hallucination Benchmark (Table 2)

CHAIR, POPE, MMHal-Bench ํ‰๊ฐ€์—์„œ

  • Hallucination ๊ด€๋ จ ์ง€ํ‘œโ†“
  • ์ •ํ™•๋„/์‹ ๋ขฐ๋„ ๊ด€๋ จ ์ง€ํ‘œโ†‘

โ†’ โ€œ์ด๋ฏธ์ง€๋ฅผ ๋” ์ž˜ ๋ณด๋Š” ๊ฒƒ๋งŒ์œผ๋กœ๋„ ํ—›๊ฒƒ์„ ๋œ ๋ณธ๋‹คโ€ ๋ฅผ ์ •๋Ÿ‰์ ์œผ๋กœ ๋ณด์—ฌ์คŒ.

3) Vision-Centric Benchmark (Table 3)

  • MMVP, CV-Bench 2D/3D ๊ฐ™์€ ๊ณต๊ฐ„ ์ดํ•ดยท3D ๊ด€๊ณ„ ์ค‘์‹ฌ ๋ฒค์น˜์—์„œ๋„
    ์ผ๊ด€๋œ ๊ฐœ์„ .

๐Ÿ”ฌ Ablation: ์™œ โ€œ์ด๋ฏธ์ง€ ํ† ํฐ๋งŒโ€ ์†๋Œ€์•ผ ํ• ๊นŒ?

๋…ผ๋ฌธ์—์„œ๋Š” ๋‹ค์–‘ํ•œ ablation๋„ ํ•ด๋ณธ๋‹ค:

  1. ๋ชจ๋“  head์— VAR ์ ์šฉ
    โ†’ ์•„์˜ˆ ๋ชจ๋ธ์ด ๋ง๊ฐ€์ ธ์„œ ์ ์ˆ˜ 0 (Table 4์˜ w/o Head selection)
    โ†’ head selection(ICH๋งŒ ์ˆ˜์ •)์ด ํ•„์ˆ˜

  2. Attention budget์„

    • Text + Visual ๋ชจ๋‘์— ์žฌ๋ถ„๋ฐฐ
    • Text์—๋งŒ ์žฌ๋ถ„๋ฐฐ
    • Visual์—๋งŒ ์žฌ๋ถ„๋ฐฐ(๋ณธ ๋ฐฉ๋ฒ•)

    ๊ฒฐ๊ณผ (Table 5):

    • Text-only: ์˜คํžˆ๋ ค ์„ฑ๋Šฅ ์•…ํ™”
    • Text+Visual: ์•ฝ๊ฐ„ ์ด๋“
    • Visual-only(๋ณธ ๋ฐฉ๋ฒ•)๊ฐ€ ๊ฐ€์žฅ ํฌ๊ณ  ์•ˆ์ •์ ์ธ ์ด๋“

    โ†’ LMM์€ ์ด๋ฏธ ํ…์ŠคํŠธ์—๋Š” ์ถฉ๋ถ„ํžˆ ์ง‘์ค‘ํ•˜๊ณ  ์žˆ์—ˆ๊ณ ,
    ์ง„์งœ ๋ถ€์กฑํ–ˆ๋˜ ๊ฑด ์ด๋ฏธ์ง€ ์ชฝ attention์ด๋ผ๋Š” ์˜๋ฏธ.


๐Ÿง  ๋‚˜์˜ ์ฝ”๋ฉ˜ํŠธ!

์ด ๋…ผ๋ฌธ์„ StreamingLLM์˜ Attention Sink์™€ ์—ฐ๊ฒฐํ•ด์„œ ๋ณด๋ฉด ์ง„์งœ ์žฌ๋ฏธ์žˆ๋‹ค.

  • StreamingLLM:
    • โ€œ์ดˆ๊ธฐ ๋ช‡ ํ† ํฐ์ด ์‚ฌ์‹ค์ƒ attention sink/๋ ˆ์ง€์Šคํ„ฐ ์—ญํ• ์„ ํ•œ๋‹ค.
      โ†’ ๊ทธ ํ† ํฐ๋“ค๋งŒ ์œ ์ง€ํ•ด๋„ ๋ฌดํ•œ ์ŠคํŠธ๋ฆฌ๋ฐ ๊ฐ€๋Šฅโ€
  • ์ด๋ฒˆ Visual Attention Sink ๋…ผ๋ฌธ:
    • โ€œ์ด๋ฏธ์ง€์—์„œ๋„ ๋น„์˜๋ฏธ์ ์ธ sink ํŒจ์น˜๊ฐ€ ์กด์žฌํ•œ๋‹ค.
      โ†’ ์—ฌ๊ธฐ์— ์Ÿ์•„์ง€๋Š” attention์„ ํšŒ์ˆ˜ํ•˜๋ฉด ์ด๋ฏธ์ง€ ์ดํ•ด๊ฐ€ ์ข‹์•„์ง„๋‹ค.โ€

๋‘˜์˜ ๊ณตํ†ต์ :

๋ชจ๋ธ์€ ์–ธ์–ด๋“  ๋น„์ „์ด๋“  ๋‚ด๋ถ€ ๊ณ„์‚ฐ์„ ์œ„ํ•œ โ€˜์“ฐ๋ ˆ๊ธฐํ†ต/๋ ˆ์ง€์Šคํ„ฐ ๊ณต๊ฐ„โ€™์„ ์ž๋ฐœ์ ์œผ๋กœ ๋งŒ๋“ค์–ด ์“ด๋‹ค.
์ด๊ฒŒ ํ•™์Šต์˜ ๋ถ€์‚ฐ๋ฌผ์ฒ˜๋Ÿผ ์ƒ๊ฒผ๋Š”๋ฐ,
๋‚˜์ค‘์— ํ•ด์„ ๊ด€์ ์—์„œ ๋ณด๋ฉด ๊ฝค ์ผ๊ด€๋œ ๊ตฌ์กฐ์  ํŒจํ„ด์ด๋ผ๋Š” ์ ์ด ํฅ๋ฏธ๋กญ๋‹ค.

๋˜ ํ•˜๋‚˜์˜ ์ธ์‚ฌ์ดํŠธ:

  • ViT์˜ register token (Darcet 2023)
  • ์–ธ์–ด LLM์˜ attention sink token
  • LMM์˜ visual attention sink token

์ด ์„ธ ๊ฐ€์ง€๊ฐ€ โ€œ๋ชจ๋ธ ์•ˆ์—์„œ ์ •๋ณด๋ฅผ ์ €์žฅยท๊ณ ์ •ํ•˜๋Š” ์—ญํ• โ€์ด๋ผ๋Š”
ํ•˜๋‚˜์˜ ํฐ ํŒจํ„ด ์œ„์— ๋†“์—ฌ ์žˆ๋‹ค๋Š” ๋А๋‚Œ์„ ์ค€๋‹ค.

์•ž์œผ๋กœ๋Š” โ€œ์ด sink/๋ ˆ์ง€์Šคํ„ฐ ๊ณต๊ฐ„์„ ์–ด๋–ป๊ฒŒ ์„ค๊ณ„ยท์ œ์–ดํ•˜๋А๋ƒโ€๊ฐ€
๋‹จ์ˆœ ํšจ์œจ์„ ๋„˜์–ด์„œ ํ•ด์„ ๊ฐ€๋Šฅํ•œ ์ œ์–ด(steering) ์˜ ํ•ต์‹ฌ ์ถ•์ด ๋  ์ˆ˜๋„ ์žˆ์„ ๊ฒƒ ๊ฐ™๋‹ค.


โœ… ์ •๋ฆฌ

  • LMM์€ ์ด๋ฏธ์ง€ ์•ˆ์˜ ์˜๋ฏธ ์—†๋Š” ํŒจ์น˜(visual sink)์—
    ๊ณผ๋„ํ•œ attention์„ ์ฃผ๋Š” ๊ฒฝํ–ฅ์ด ์žˆ๋‹ค.
  • ์ด ํ† ํฐ๋“ค์€ ์–ธ์–ด ๋ชจ๋ธ์˜ BOS sink์™€ ๊ฐ™์ด
    ํŠน์ • hidden dimension์—์„œ massive activation์„ ๋ณด์ด๋ฉฐ, ์‹ค์ œ ์˜ˆ์ธก์—๋Š” ๊ฑฐ์˜ ๊ธฐ์—ฌํ•˜์ง€ ์•Š๋Š”๋‹ค.
  • ๋…ผ๋ฌธ์€ ์ด ๋‚ญ๋น„๋˜๋Š” attention์„ attention budget์œผ๋กœ ๋ณด๊ณ , ์ด๋ฏธ์ง€-์ค‘์‹ฌ head์—์„œ๋งŒ visual non-sink ํ† ํฐ์œผ๋กœ ์žฌ๋ถ„๋ฐฐ(VAR) ํ•˜์ž๊ณ  ์ œ์•ˆํ•œ๋‹ค.
  • ์ถ”๊ฐ€ ํ•™์Šต ์—†์ด๋„
    • ์ผ๋ฐ˜ VL ๋ฒค์น˜๋งˆํฌ
    • hallucination ๊ฐ์†Œ
    • vision-centric tasks
      ๋ชจ๋‘์—์„œ ์•ˆ์ •์ ์ธ ์„ฑ๋Šฅ ํ–ฅ์ƒ์„ ๋‹ฌ์„ฑํ•œ๋‹ค.

ํ•„์š”ํ•˜๋‹ค๋ฉด ๋‹ค์Œ ํฌ์ŠคํŠธ๋กœ

  • VAR ์ˆ˜์‹/์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์ฝ”๋“œ ์ˆ˜์ค€์œผ๋กœ ํ’€์–ด ์“ฐ๊ธฐ (PyTorch pseudo-code)
  • StreamingLLM์˜ attention sink์™€ ์ด ๋…ผ๋ฌธ์„ ๊ณตํ†ต ํ”„๋ ˆ์ž„์›Œํฌ๋กœ ๋ฌถ์–ด ๋ณด๋Š” ๋ฆฌ๋ทฐ
  • Vision Transformer register token / LMM visual sink / LLM text sink๋ฅผ
    ํ•˜๋‚˜์˜ โ€œhidden workspaceโ€ ๊ด€์ ์—์„œ ๋น„๊ต ๋ถ„์„

๊ฐ™์€ ๊ฒƒ๋“ค์„ ์ด์–ด์„œ ์ •๋ฆฌํ•ด๋ด๋„ ์žฌ๋ฏธ์žˆ์„ ๊ฒƒ ๊ฐ™๋‹ค ๐Ÿ™‚

This post is licensed under CC BY 4.0 by the author.