Post

๐Ÿง Lost in the Middle - ๊ธด ๋ฌธ๋งฅ์—์„œ ์–ธ์–ด๋ชจ๋ธ์ด ์ง„์งœ ์ •๋ณด๋ฅผ ๊ธฐ์–ตํ• ๊นŒ?

๐Ÿง Lost in the Middle - ๊ธด ๋ฌธ๋งฅ์—์„œ ์–ธ์–ด๋ชจ๋ธ์ด ์ง„์งœ ์ •๋ณด๋ฅผ ๊ธฐ์–ตํ• ๊นŒ?

๐Ÿง  Reading the Paper Lost in the Middle

๐Ÿ” LLMs struggle to remember information located in the middle of long documents!

Paper: Lost in the Middle: How Language Models Use Long Contexts
Venue: TACL 2023 (Liu, Nelson F., et al.)


โ“ Core Question from the Paper

โ€œCan language models utilize information equally regardless of its position in a long context?โ€

Short answer: No.

  • Most LLMs are least effective at recalling information located in the middle of long documents.
  • Even with large context windows, position bias still persists.

As shown below, the performance follows a U-shape curve:
Models perform best when the answer is at the beginning (primacy) or end (recency),
but significantly worse when it is in the middle.

u_shape


๐Ÿงช Experiment: Needle-in-a-Haystack

Setup:

  • Insert a single key fact (โ€œneedleโ€) into a long passage
  • Place it at the beginning / middle / end of the input
  • Ask the model to extract that specific information
1
2
3
4
5
Example:
Document length = 8,000 tokens
[... lengthy text ...]
โ†’ Insert target sentence in the middle
โ†’ Ask: "What was the number mentioned in the document above?"

๐Ÿ‘‡ The target sentence is hidden in the middle of the input like this:

prompt_sample


๐Ÿ“‰ Summary of Results (Figure 5)

When the answer is in the middle of the document, model accuracy drops significantly.
โ†’ Most models, including GPT-3.5, show a U-shaped performance curve.

PositionRecall (GPT-3.5)Recall (Claude-1.3)
BeginningHighHigh
Middle โš ๏ธLowestSlightly lower
EndHighHigh

result

๐Ÿ” GPT-4 also shows similar patterns in a subset of experiments,
but was excluded from full-scale experiments due to high cost (see Appendix D).


๐Ÿ“Œ Why does this happen? (Section ยง2.3, ยง3.2)

  • ๐Ÿ”’ Limitations of absolute positional encoding
  • ๐Ÿ”„ Self-attentionโ€™s inherent position bias
    โ†’ Stronger focus on early (primacy) and late (recency) positions
    โ†’ Middle positions receive less attention
  • ๐Ÿ“ The longer the input, the more the performance degrades,
    with over 20% drop in 30-document settings (GPT-3.5)

๐Ÿง  Why does this matter?

Most real-world tasks like RAG, multi-document QA, and summarization rely on long input contexts.
But what if the model ignores the middle?

  • ๐Ÿ“‰ The position of retrieved documents directly impacts answer accuracy
  • ๐Ÿ”€ Effective chunking and ordering of key information is critical
  • โ— Simply increasing the context window size is not enough

๐Ÿ’ก Takeaways: Position Bias Matters

โ€œLLMs can remember contextโ€”but mainly the beginning and the end.โ€

Strategies to mitigate position bias:

  • โœ… Query-aware contextualization
    โ†’ Place the query before the documents for decoder-only models
  • โœ… Chunk ordering optimization
    โ†’ Put more relevant content earlier in the input
  • โœ… Improved attention architectures
    โ†’ Encoder-decoder models (e.g., T5, BART) perform better with long input
  • โœ… Position-free architectures
    โ†’ Hyena, RWKV, and other models aim to remove positional dependence

๐Ÿ” Retrieval-Based QA Setup (Section ยง2.2)

  • Task: Multi-document question answering
  • Retriever: Contriever (fine-tuned on MS-MARCO)
  • Reader input: Top-k retrieved documents + query
  • Number of docs (k): 10, 20, 30
  • Document type: Only paragraphs (no tables or lists)

๐Ÿ“ˆ Impact of Increasing Retrieved Docs (Figure 5)

retriever

  • โœ… k = 10 or 20 โ†’ improved accuracy
  • โš ๏ธ k = 30 โ†’ performance plateaus or drops
    • When the relevant document appears in the middle, accuracy drops
    • Some models even perform worse than closed-book setting

โ— Retrieval Alone Is Not Enough

  • Even if retrieval includes the correct document,
    models may fail to use it effectively, especially if itโ€™s in the middle.

Retrieval โ‰  success
โ†’ Prompt design must account for position bias

Practical strategies:

  • โœ… Move relevant docs closer to the top
  • โœ… Use query-aware formatting
  • โœ… Minimize irrelevant context

โœ… TL;DR

โ€œLLMs remember long contexts โ€” but often forget whatโ€™s in the middle.โ€


๐Ÿง  (ํ•œ๊ตญ์–ด) Lost in the Middle ๋…ผ๋ฌธ ์ฝ๊ธฐ

๐Ÿ” LLM์€ ๊ธด ๋ฌธ์„œ ์ค‘๊ฐ„์— ์žˆ๋Š” ์ •๋ณด๋Š” ์ž˜ ๊ธฐ์–ตํ•˜์ง€ ๋ชปํ•จ!!

๋…ผ๋ฌธ: Lost in the Middle: How Language Models Use Long Contexts
๋ฐœํ‘œ: TACL 2023 (Liu, Nelson F., et al.)


โ“ ๋…ผ๋ฌธ์ด ๋˜์ง„ ํ•ต์‹ฌ ์งˆ๋ฌธ

โ€œ๊ธด context ์•ˆ์—์„œ, ๋ชจ๋ธ์€ **๋ชจ๋“  ์œ„์น˜์˜ ์ •๋ณด๋ฅผ ๊ท ๋“ฑํ•˜๊ฒŒ ํ™œ์šฉํ•  ์ˆ˜ ์žˆ์„๊นŒ?โ€

๊ฒฐ๋ก : No.

  • ๋Œ€๋ถ€๋ถ„์˜ LLM์€ ๊ธด ๋ฌธ์„œ์—์„œ ์ค‘๊ฐ„ ์ •๋ณด๋ฅผ ๊ฐ€์žฅ ์ž˜ ๋†“์นฉ๋‹ˆ๋‹ค.
  • context window๊ฐ€ ์•„๋ฌด๋ฆฌ ๊ธธ์–ด๋„ ์œ„์น˜ ํŽธํ–ฅ(position bias)์ด ์กด์žฌํ•ฉ๋‹ˆ๋‹ค.

์•„๋ž˜ ์ด๋ฏธ์ง€์ฒ˜๋Ÿผ U-shape ์„ฑ๋Šฅ ๊ณก์„ ์ด ๋‚˜ํƒ€๋‚˜๋ฉฐ,
์•ž(primacy)๊ณผ ๋’ค(recency) ์ •๋ณด๋Š” ์ž˜ ๊ธฐ์–ตํ•˜์ง€๋งŒ,
์ค‘๊ฐ„ ์ •๋ณด๋Š” ๊ธฐ์–ต๋ ฅ์ด ๊ธ‰๋ฝํ•ฉ๋‹ˆ๋‹ค.

u_shape


๐Ÿงช ์‹คํ—˜: Needle-in-a-Haystack

์‹คํ—˜ ๊ตฌ์„ฑ:

  • ๊ธด ๋ฌธ์„œ์— ๋‹จ ํ•˜๋‚˜์˜ ํ•ต์‹ฌ ์ •๋ณด (โ€œneedleโ€)๋ฅผ ์‚ฝ์ž…
  • ์ •๋ณด๋ฅผ ๋ฌธ์„œ์˜ ์•ž / ์ค‘๊ฐ„ / ๋์— ์œ„์น˜์‹œํ‚ค๊ณ  ๋น„๊ต
  • ๋ชจ๋ธ์—๊ฒŒ ํ•ด๋‹น ์ •๋ณด๋ฅผ ์ •ํ™•ํžˆ ์ถ”์ถœํ•˜๋„๋ก ์งˆ๋ฌธ
1
2
3
4
5
์˜ˆ์‹œ:
๋ฌธ์„œ ๊ธธ์ด = 8,000 tokens
[... ๊ธด ํ…์ŠคํŠธ ...]
โ†’ ์ค‘๊ฐ„์— "์ •๋‹ต ๋ฌธ์žฅ" ์‚ฝ์ž…
โ†’ ๋ชจ๋ธ์—๊ฒŒ: "์œ„ ๋ฌธ์„œ์— ๋‚˜์˜จ ์ˆซ์ž๋Š” ๋ช‡์ด์—ˆ์ง€?" ์งˆ๋ฌธ

๐Ÿ‘‡ ์•„๋ž˜์™€ ๊ฐ™์ด ์ค‘๊ฐ„ ์œ„์น˜์— ํ•ต์‹ฌ ์ •๋ณด๊ฐ€ ์žˆ๋Š” ํ”„๋กฌํ”„ํŠธ๋ฅผ ํ™œ์šฉํ•˜์—ฌ ์„ฑ๋Šฅ์„ ์ธก์ •ํ•ฉ๋‹ˆ๋‹ค:

prompt_sample


๐Ÿ“‰ ์‹คํ—˜ ๊ฒฐ๊ณผ ์š”์•ฝ

๋ฌธ์„œ ์ค‘๊ฐ„์— ์ •๋ณด๊ฐ€ ์žˆ์„ ๊ฒฝ์šฐ, ๋ชจ๋ธ ์ •ํ™•๋„๊ฐ€ ๊ธ‰๋ฝ
โ†’ GPT-3.5 ๋ฐ ๋Œ€๋ถ€๋ถ„์˜ ๋ชจ๋ธ์—์„œ U์žํ˜• ์„ฑ๋Šฅ ๊ณก์„ ์ด ๋‚˜ํƒ€๋‚จ

์œ„์น˜ํšŒ์ƒ๋ฅ  (GPT-3.5)ํšŒ์ƒ๋ฅ  (Claude-1.3)
์•ž๋ถ€๋ถ„๋†’์Œ๋†’์Œ
์ค‘๊ฐ„ โš ๏ธ์ตœ์ € ์„ฑ๋Šฅ์†Œํญ ์ €ํ•˜
๋๋ถ€๋ถ„๋†’์Œ๋†’์Œ

result

๐Ÿ” GPT-4๋„ ์ผ๋ถ€ ์‹คํ—˜์—์„œ ์œ ์‚ฌํ•œ ์„ฑ๋Šฅ ํŒจํ„ด์„ ๋ณด์˜€์œผ๋‚˜,
์ „์ฒด ์‹คํ—˜์—๋Š” ํฌํ•จ๋˜์ง€ ์•Š์•˜์œผ๋ฉฐ Appendix D์— ์ œํ•œ์ ์œผ๋กœ ๋ณด๊ณ ๋จ.


๐Ÿ“Œ ์ด ํ˜„์ƒ์˜ ์›์ธ (๋…ผ๋ฌธ ยง2.3, ยง3.2)

  • ๐Ÿ”’ Absolute positional encoding์˜ ๊ตฌ์กฐ์  ํ•œ๊ณ„
  • ๐Ÿ”„ Self-attention์˜ ์œ„์น˜ ํŽธํ–ฅ(position bias)
    โ†’ ์•ž(primacy)๊ณผ ๋’ค(recency)์— ์ฃผ์˜๋ฅผ ์ง‘์ค‘, ์ค‘๊ฐ„์€ ํฌ์„
  • ๐Ÿ“ ๋ฌธ์„œ ๊ธธ์ด๊ฐ€ ๊ธธ์ˆ˜๋ก ์„ฑ๋Šฅ ํ•˜๋ฝ ํญ์ด ๋” ์ปค์ง
    โ†’ GPT-3.5 ๊ธฐ์ค€ 30-document ์„ค์ •์—์„œ 20% ์ด์ƒ ์„ฑ๋Šฅ ํ•˜๋ฝ

๐Ÿง  ์™œ ์ค‘์š”ํ•œ๊ฐ€?

๋Œ€๋ถ€๋ถ„์˜ RAG (Retrieval-Augmented Generation), multi-document QA, long-context summarization ์‹œ์Šคํ…œ์€
๊ธด ๋ฌธ๋งฅ์„ ํ™œ์šฉํ•ฉ๋‹ˆ๋‹ค. ๊ทธ๋Ÿฐ๋ฐ ์ค‘๊ฐ„ ์ •๋ณด๋ฅผ ๋ชจ๋ธ์ด ๋ฌด์‹œํ•œ๋‹ค๋ฉด?

  • ๐Ÿ“‰ ๊ฒ€์ƒ‰ ๊ฒฐ๊ณผ์˜ ์œ„์น˜๊ฐ€ QA ์„ฑ๋Šฅ์— ์ง์ ‘ ์˜ํ–ฅ
  • ๐Ÿ”€ ์ค‘์š” ๋ฌธ์„œ๋‚˜ ํ•ต์‹ฌ ์ •๋ณด๋Š” ์•ž์ชฝ์— ๋ฐฐ์น˜ํ•ด์•ผ ํšจ๊ณผ์ 
  • โ— ๋‹จ์ˆœํžˆ context window๋ฅผ ๋Š˜๋ฆฌ๋Š” ๊ฒƒ๋งŒ์œผ๋กœ๋Š” ๋ฌธ์ œ ํ•ด๊ฒฐ โŒ

๐Ÿ’ก ์‹œ์‚ฌ์ : ์œ„์น˜ ํŽธํ–ฅ์„ ๊ณ ๋ คํ•œ ํ™œ์šฉ ์ „๋žต

โ€œLLM์€ context๋ฅผ ๊ธฐ์–ตํ•œ๋‹ค. ํ•˜์ง€๋งŒ, ๊ทธ๊ฑด ์‹œ์ž‘๊ณผ ๋์ผ ๋ฟโ€ฆโ€

์œ„์น˜ ํŽธํ–ฅ์„ ์ค„์ด๊ธฐ ์œ„ํ•œ ์ „๋žต:

  • โœ… Query-aware contextualization
    โ†’ ๋””์ฝ”๋”-์˜จ๋ฆฌ ๋ชจ๋ธ์—์„œ๋Š” ์งˆ๋ฌธ์„ ๋ฌธ์„œ ์•ž์— ๋จผ์ € ์ œ์‹œ
  • โœ… Chunk ordering optimization
    โ†’ ์ค‘์š”ํ•œ ์ •๋ณด๋ฅผ ์•ž์ชฝ์—, ๋œ ์ค‘์š”ํ•œ ๊ฑด ๋’ค๋กœ ์žฌ๋ฐฐ์น˜
  • โœ… Attention ๊ตฌ์กฐ ๊ฐœ์„ 
    โ†’ ์–‘๋ฐฉํ–ฅ ์ธ์ฝ”๋”๊ฐ€ ์žˆ๋Š” ๋ชจ๋ธ (T5, BART ๋“ฑ)์ด ๋” ์œ ๋ฆฌ
  • โœ… Position-free architectures
    โ†’ Hyena, RWKV ๋“ฑ ์ƒˆ๋กœ์šด ๊ตฌ์กฐ๋Š” ์œ„์น˜ ๋…๋ฆฝ์„ฑ์„ ์ถ”๊ตฌํ•จ

๐Ÿ” Retrieval ๊ธฐ๋ฐ˜ ์‹คํ—˜ ๊ตฌ์„ฑ (๋…ผ๋ฌธ ยง2.2)

  • Task: Multi-document QA
  • Retriever: Contriever (MS-MARCO fine-tuned)
  • Reader ์ž…๋ ฅ: ๊ฒ€์ƒ‰๋œ k๊ฐœ์˜ ๋ฌธ์„œ + ์งˆ๋ฌธ
  • ๋ฌธ์„œ ์ˆ˜(k): 10, 20, 30๊ฐœ
  • ๋ฌธ์„œ ํ˜•ํƒœ: paragraph ๊ธฐ๋ฐ˜ (ํ‘œ, ๋ชฉ๋ก์€ ์ œ์™ธ)

๐Ÿ“ˆ Retrieval ์ˆ˜ ์ฆ๊ฐ€ vs ์„ฑ๋Šฅ ๋ณ€ํ™”

retriever

  • โœ… k=10, 20 โ†’ ์„ฑ๋Šฅ ํ–ฅ์ƒ
  • โš ๏ธ k=30 โ†’ ์„ฑ๋Šฅ ํ•˜๋ฝ ๋˜๋Š” ํฌํ™”
    • ์ •๋‹ต ๋ฌธ์„œ๊ฐ€ ์ค‘๊ฐ„์— ์œ„์น˜ํ•  ๊ฒฝ์šฐ ์ •ํ™•๋„ ๊ธ‰๋ฝ
    • ์ผ๋ถ€ ๋ชจ๋ธ์€ closed-book ์„ฑ๋Šฅ๋ณด๋‹ค๋„ ๋‚ฎ์•„์ง

โ— Retrieval์€ ์ž˜ ๋˜์–ด๋„, ํ™œ์šฉ์€ ์–ด๋ ค์›€

  • LLM์€ ์ •๋‹ต ๋ฌธ์„œ๋ฅผ ๋ฐ›์•„๋„,
    ๊ทธ ์ •๋ณด๊ฐ€ ์ค‘๊ฐ„์— ์žˆ์œผ๋ฉด ์ž˜ ์‚ฌ์šฉํ•˜์ง€ ๋ชปํ•จ

Retrieval๋งŒ ์ž˜ํ•ด๋„ ์„ฑ๋Šฅ์ด ๋ณด์žฅ๋˜์ง€ ์•Š์Œ!
โ†’ LLM์˜ ์œ„์น˜ ํŽธํ–ฅ์„ ๊ณ ๋ คํ•œ prompt ๊ตฌ์กฐ ์„ค๊ณ„ ํ•„์ˆ˜

ํ•ด๊ฒฐ ์ „๋žต ์˜ˆ์‹œ:

  • โœ… ์ •๋‹ต ๋ฌธ์„œ๋ฅผ ํ”„๋กฌํ”„ํŠธ ์•ž์— ๋ฐฐ์น˜
  • โœ… Query-aware ๊ตฌ์กฐ ์‚ฌ์šฉ
  • โœ… Noise ๋ฌธ์„œ ์ˆ˜๋ฅผ ์ค„์ด๊ธฐ (๋ฌธ์„œ ์„ ํƒ ์••์ถ•)

โœ… ํ•œ ์ค„ ์š”์•ฝ

โ€œLLM์€ ๊ธด context๋ฅผ ๊ธฐ์–ตํ•˜์ง€๋งŒ context ๋‚ด์˜ ์ค‘๊ฐ„๋ถ€๋ถ„์€ ์ž˜ ๋ง๊ฐํ•œ๋‹ค!!!

This post is licensed under CC BY 4.0 by the author.