Post

๐Ÿ“ LoRA: Low-Rank Fine-Tuning for Large Language Models - Understanding LORA- LORA ์•Œ์•„๋ณด๊ธฐ?!!

๐Ÿ“ LoRA: Low-Rank Fine-Tuning for Large Language Models - Understanding LORA- LORA ์•Œ์•„๋ณด๊ธฐ?!!

๐Ÿง  LoRA: Low-Rank Fine-Tuning for Large Language Models

๐Ÿ” Lightweight and Fast! A New Paradigm for Efficient Fine-Tuning

Image

Paper: LoRA: Low-Rank Adaptation of Large Language Models
Conference: ICLR 2022 (Edward J. Hu et al., Microsoft Research)
Code: microsoft/LoRA
Comment: A groundbreaking method that brings language model efficiency into new domains!


๐Ÿ“Œ Summary

Want to fine-tune a large LLMโ€ฆ but without the massive GPU cost?

Traditionally, fine-tuning meant retraining every parameter in the model โ€” which could mean billions of weights, just like training from scratch.

LoRA solves this by enabling effective fine-tuning while learning only a tiny fraction of the parameters, often achieving comparable or even better performance.

๐ŸŽฏ Core Idea:
๐Ÿ‘‰ โ€œKeep the original model weights frozen. Train only a small, lightweight module (low-rank matrices) added alongside.โ€


๐Ÿง  Why LoRA?

๐Ÿ“Œ The Challenge of Full Fine-Tuning in LLMs

  • Modern LLMs like GPT-3 have hundreds of billions of parameters
  • Fine-tuning every parameter is:
    • ๐Ÿ’พ Storage-heavy: Each task needs a full model copy
    • ๐Ÿš€ Deployment-unfriendly: Task switching is slow and heavy
    • ๐Ÿ’ธ Expensive: Requires huge compute and memory

๐Ÿ’ก Limitations of Previous Methods

  1. Adapter Layers
    • Inserts bottleneck networks into the Transformer blocks
    • โœ… Efficient in parameter count
    • โŒ But adds latency, especially problematic in online inference or sharded deployments
  2. Prompt/Prefix Tuning
    • Adds trainable tokens to the input sequence
    • โœ… Keeps the model architecture unchanged
    • โŒ Suffers from optimization instability, and reduces the usable sequence length

๐Ÿš€ Motivation Behind LoRA

LoRA is based on the observation that parameter updates during fine-tuning lie in a low-dimensional space.

Thus:

  • Instead of updating full weight matrices,
  • LoRA learns a low-rank update:
    [ \Delta W = B A ]
  • Only matrices A and B are trainable; the base model is frozen

โœ… Result:
Less memory, fewer FLOPs, and no inference slowdown!


๐Ÿ—๏ธ How LoRA Works

๐Ÿ’ก Low-Rank Update Parameterization

W' = W + \Delta W = W + B A

  • A โˆˆ โ„^{rร—d} : initialized with random Gaussian
  • B โˆˆ โ„^{dร—r} : initialized with zeros
  • r โ‰ช d โ†’ low-rank structure
  • The base weight W is frozen โ€” only A and B are trained.

This setup allows dramatic reduction in trainable parameters and FLOPs, while maintaining speed and performance!

๐Ÿค” But how small can r go?

Smaller r means less resource usage โ€” but is it still effective?

This was explored in Section 7.2: What is the Optimal Rank r for LoRA?

โœ… Result: Even r = 1 yields surprisingly strong performance!
โœ… LoRA with r = 8 and r = 64 was compared using vector subspace similarity, and the overlap was high!

Subspace Overlap


โ–ถ๏ธ Forward Pass Modification

  • Original: h = Wโ‚€ ร— x
  • With LoRA: h = Wโ‚€ ร— x + B A ร— x

โ†’ The two outputs are added element-wise (same dimensions).
โ†’ This allows LoRA to introduce updates without altering architecture.


๐Ÿง  How LoRA is Applied to Transformers

๐Ÿ”ง Target Weight Matrices

  • In Self-Attention Modules:
    • ( W_q ): Query
    • ( W_k ): Key
    • ( W_v ): Value
    • ( W_o ): Output
  • In MLP Modules: two Dense layers

In experiments, W_q, W_k, W_v are treated as unified square matrices
(even though in practice they are divided across attention heads)

Most commonly, LoRA is applied to Wq and Wv.
See Section 7.2 for ablations on rank selection and subspace behavior:

LoRA Rank Ablation


โš™๏ธ Training Strategy

  • Only attention weight matrices are trained with LoRA
  • MLP, LayerNorm, and bias parameters are frozen

โ†’ Simple and highly parameter-efficient.


โœ… Practical Benefits of LoRA

  • Memory Efficiency:
    • GPT-3 175B full fine-tuning: 1.2TB
    • LoRA fine-tuning: 350GB
  • Checkpoint Size Reduction:
    • With r = 4, training only Q/V โ†’ 350GB โ†’ 35MB (~10,000ร— smaller)
  • Trainable on modest hardware
    • Avoids I/O bottlenecks
  • Low-cost Task Switching
    • Just swap LoRA modules instead of the entire model
  • 25% Faster Training
    • Most parameters are frozen โ€” gradients are computed only for low-rank matrices

โš ๏ธ Limitations

  • If you merge B A into W to avoid runtime overhead:
    • Itโ€™s difficult to batch tasks with different LoRA modules
  • However, when latency is not critical:
    • You can keep LoRA unmerged and dynamically swap modules per sample

๐Ÿš€ LoRA in Empirical Evaluation

This work compares LoRA against several fine-tuning methods:

  • Full Fine-Tuning (FT)
    Trains all parameters. Standard method but memory-heavy.

  • BitFit (Bias-only Tuning)
    Trains only bias vectors โ€” very light, but limited capacity.

  • Prefix Tuning (PreEmbed)
    Adds trainable tokens to the input โ€” only embeddings are trained.

  • Prefix Layer Tuning (PreLayer)
    Learns intermediate activations at each layer โ€” more expressive.

  • Adapter Tuning
    Adds small MLP โ€œadaptersโ€ to each layer โ€” multiple variants (AdapterH, AdapterL, etc.)

  • LoRA (Low-Rank Adaptation)
    Adds parallel low-rank matrices to attention weights โ€” maintains full inference speed
    while dramatically reducing memory and parameter size.


๐Ÿ“Š Result?

LoRA achieves great performance while training far fewer parameters!

Performance Graph

  • On the GLUE benchmark (NLU), LoRA matches or outperforms full FT on RoBERTa/DeBERTa
  • On GPT-2 generation tasks (WikiSQL, SAMSum), LoRA outperforms Prefix Tuning
  • On GPT-3 175B, LoRA trains on 350GB VRAM โ€” while full FT would be infeasible

๐Ÿ”ฎ Conclusion

LoRA is a breakthrough method for fine-tuning large Transformer models โ€” from LLMs to ViTs to DETR.

It enables:

  • โšก Lightweight adaptation
  • ๐Ÿงช Rapid experimentation
  • ๐ŸŒ Efficient deployment
  • ๐Ÿค– Personalized AI at scale

๐Ÿง  (ํ•œ๊ตญ์–ด) LORA : LLM์„ ์œ„ํ•œ ์ €๋žญํฌ ํŒŒ์ธํŠœ๋‹ ๊ธฐ๋ฒ•

๐Ÿ” ๊ฐ€๋ณ๊ณ  ๋น ๋ฅด๊ฒŒ!! Fine-tuning์˜ ์ƒˆ๋กœ์šด ๋ฐฉ๋ฒ•๋ก  ์ œ์‹œ!

Image

๋…ผ๋ฌธ: LoRA: Low-Rank Adaptation of Large Language Models
๋ฐœํ‘œ: ICLR 2022 (Edward J. Hu et al. - Microsoft Research)
์ฝ”๋“œ: microsoft/LORA
์ฝ”๋ฉ˜ํŠธ: LLM์˜ ์–ธ์–ด ์ดํ•ด ๋Šฅ๋ ฅ์„ ์‹œ๊ฐ ๋ถ„ํ• ์— ์ ‘๋ชฉํ•œ ํš๊ธฐ์ ์ธ ์ ‘๊ทผ!


๐Ÿ“Œ ์š”์•ฝ

์—„์ฒญ ์ข‹์€ LLM ์„ ์กฐ๊ธˆ ์ˆ˜์ •ํ•˜๊ณ ์‹ถ์„๋–„!!
๊ธฐ์กด ๋ฐฉ๋ฒ•๋“ค์€ LLM๋งŒ๋“ค๋•Œ์™€ ์œ ์‚ฌํ•œ ์ธํ”„๋ผ๋ฅผ ๊ฐ€์ง€๊ณ  ๋ฏธ์„ธ ์กฐ์ •(fine-tuning)์„ ํ•ด์•ผํ–ˆ์Šต๋‹ˆ๋‹ค!!
์™œ๋ƒํ•˜๋ฉด ๊ธฐ์กด ๋ฐฉ์‹์€ full fine-tuning ๋ฐฉ์‹์œผ๋กœ,
์ˆ˜์‹ญ์–ต ๊ฐœ์˜ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์—…๋ฐ์ดํŠธํ•ด์•ผ ํ–ˆ๊ธฐ ๋–„๋ฌธ์ž…๋‹ˆ๋‹ค!

ํ•˜์ง€๋งŒ LoRA๋Š”! ์ด๋Ÿฐ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ์œ„ํ•ด ๋“ฑ์žฅํ•œ ๋ฏธ์„ธ์กฐ์ • ๊ธฐ๋ฒ•์œผ๋กœ!
ํ›จ์”ฌ ์ ์€ ํŒŒ๋ผ๋ฏธํ„ฐ๋งŒ ์ถ”๊ฐ€ ํ•™์Šตํ•˜๋ฉด์„œ๋„ ๋น„์Šทํ•˜๊ฑฐ๋‚˜ ๋” ์ข‹์€ ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ•ฉ๋‹ˆ๋‹ค.

๐ŸŽฏ ํ•ต์‹ฌ ์•„์ด๋””์–ด!!
๐Ÿ‘‰ โ€œ๊ธฐ์กด ๋ชจ๋ธ ๊ฐ€์ค‘์น˜๋Š” ๊ณ ์ •ํ•˜๊ณ , ๋’ท๋ถ€๋ถ„์— ์ถ”๊ฐ€๋œ ๋‹ค๋ฅธ ๋ถ€๋ถ„(์ €๋žญํฌ ํ–‰๋ ฌ, Low-Rank Matrices)๋งŒ ํ•™์Šตํ•œ๋‹ค!โ€


๐Ÿง  LORA ๋“ฑ์žฅ์˜ ๋ฐฐ๊ฒฝ

๐Ÿ“Œ ๋ฌธ์ œ์˜์‹: ๋Œ€๊ทœ๋ชจ LLM์˜ ํ•œ๊ณ„

  • ์ตœ๊ทผ ์–ธ์–ด ๋ชจ๋ธ(GPT ๋“ฑ)์€ ์ˆ˜์‹ญ์–ต~์ˆ˜์ฒœ์–ต ๊ฐœ์˜ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ๊ฐ€์ง€๋ฉฐ, ์ด๋ฅผ ์ „์ฒด ํŒŒ์ธํŠœ๋‹(fine-tuning) ํ•˜๋Š” ๊ฒƒ์€ ๋งค์šฐ ๋น„ํšจ์œจ์ 
  • ํƒœ์Šคํฌ๋งˆ๋‹ค ๋ณ„๋„๋กœ ๋ชจ๋ธ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ํ•™์Šตํ•ด์•ผ ํ•˜๋ฉฐ, ์ด๋Š” ์›๋ณธ ํŒŒ๋ผ๋ฏธํ„ฐ ( \Phi_0 ) ์™€ ๋™์ผํ•œ ํฌ๊ธฐ์ด๊ธฐ ๋•Œ๋ฌธ์—:
    • ๐Ÿ’พ ์ €์žฅ ๊ณต๊ฐ„: ํƒœ์Šคํฌ ์ˆ˜๋งŒํผ GPT-3 ์ˆ˜์ค€์˜ ๋ชจ๋ธ์„ ๋ณ„๋„๋กœ ์ €์žฅํ•ด์•ผ ํ•จ
    • ๐Ÿš€ ๋ฐฐํฌ/์šด์˜: ๋ชจ๋ธ ์ „ํ™˜ ๋น„์šฉ์ด ์ปค์ง€๊ณ  ์‹ค์‹œ๊ฐ„ ์„œ๋น„์Šค์— ๋ถ€์ ํ•ฉ
    • ๐Ÿ’ธ ํ•™์Šต ์ž์›: GPU ๋ฉ”๋ชจ๋ฆฌ์™€ ์—ฐ์‚ฐ๋น„์šฉ์ด ๊ณผ๋„ํ•˜๊ฒŒ ์ฆ๊ฐ€

๐Ÿ’ก ๊ธฐ์กด ์ ‘๊ทผ ๋ฐฉ์‹์˜ ํ•œ๊ณ„

  1. ์–ด๋Œ‘ํ„ฐ ๋ ˆ์ด์–ด (Adapter Layers)
    • Transformer ๋ธ”๋ก ์‚ฌ์ด์— ์ž‘์€ ๋ณ‘๋ชฉ ๋„คํŠธ์›Œํฌ(bottleneck)๋ฅผ ์‚ฝ์ž…ํ•˜์—ฌ ์ ์€ ์ˆ˜์˜ ํŒŒ๋ผ๋ฏธํ„ฐ๋งŒ ํ•™์Šต
    • โœ… ์žฅ์ : ์ ์€ ํŒŒ๋ผ๋ฏธํ„ฐ ํ•™์Šต
    • โŒ ๋‹จ์ :
      • ์–ด๋Œ‘ํ„ฐ ์—ฐ์‚ฐ์€ ์ˆœ์ฐจ์ ์œผ๋กœ ์ˆ˜ํ–‰๋˜๋ฏ€๋กœ ์ถ”๋ก  ์ง€์—ฐ(latency) ์ด ๋ฐœ์ƒ
      • ์‹ค์‹œ๊ฐ„ ์˜จ๋ผ์ธ ํ™˜๊ฒฝ(์˜ˆ: ๋ฐฐ์น˜ ํฌ๊ธฐ 1)์—์„  ์„ฑ๋Šฅ ์ €ํ•˜ ๋šœ๋ ท
      • ๋ชจ๋ธ ๋ณ‘๋ ฌํ™”(sharding) ํ™˜๊ฒฝ์—์„œ ํ†ต์‹  ๋น„์šฉ ์ฆ๊ฐ€
  2. ํ”„๋กฌํ”„ํŠธ ๊ธฐ๋ฐ˜ ์กฐ์ • (Prompt Tuning / Prefix Tuning)
    • ์ž…๋ ฅ ํ† ํฐ ์•ž์— ํ•™์Šต ๊ฐ€๋Šฅํ•œ ํ”„๋กฌํ”„ํŠธ๋ฅผ ์‚ฝ์ž…ํ•˜์—ฌ ์กฐ์ •
    • โœ… ์žฅ์ : ๋ชจ๋ธ ๊ตฌ์กฐ ๋ณ€๊ฒฝ ์—†์Œ
    • โŒ ๋‹จ์ :
      • ์ตœ์ ํ™”๊ฐ€ ๋ถˆ์•ˆ์ •ํ•˜๊ณ  ์„ฑ๋Šฅ์ด ๋น„์„ ํ˜•์ ์œผ๋กœ ๋ณ€ํ™”
      • ํ”„๋กฌํ”„ํŠธ๊ฐ€ ์ž…๋ ฅ ๊ธธ์ด๋ฅผ ์ฐจ์ง€ํ•ด ์ฒ˜๋ฆฌ ๊ฐ€๋Šฅํ•œ ์‹œํ€€์Šค ๊ธธ์ด ๊ฐ์†Œ

๐Ÿš€ LoRA์˜ ํ•ต์‹ฌ ๋™๊ธฐ

  • ์œ„์˜ ๋ฐฉ์‹๋“ค์€ ํšจ์œจ์„ฑ์„ ์ œ๊ณตํ•˜์ง€๋งŒ, ์‹ค์šฉ์„ฑ๊ณผ ์„ฑ๋Šฅ ๊ฐ„ ํŠธ๋ ˆ์ด๋“œ์˜คํ”„๊ฐ€ ์กด์žฌ
  • LoRA (Low-Rank Adaptation) ๋Š” ๋‹ค์Œ์˜ ๊ด€์ฐฐ์—์„œ ์ถœ๋ฐœํ•จ:
    • ๋Œ€ํ˜• ๋ชจ๋ธ์˜ ํŒŒ์ธํŠœ๋‹ ์‹œ, ์‹ค์ œ๋กœ ๋ณ€๊ฒฝ๋˜๋Š” ํŒŒ๋ผ๋ฏธํ„ฐ์˜ ๋ณ€ํ™”๋Š” ์ €์ฐจ์› ๊ณต๊ฐ„์— ์กด์žฌํ•จ
  • ๋”ฐ๋ผ์„œ,
    • ์ „์ฒด ๊ฐ€์ค‘์น˜ ๋Œ€์‹  ๋ณ€ํ™”๋Ÿ‰(โˆ†W)์„ ์ €๋žญํฌ ํ–‰๋ ฌ ( A, B ) ๋กœ ๋ถ„ํ•ดํ•˜์—ฌ ํ•™์Šต
    • ์‚ฌ์ „ํ•™์Šต๋œ ๊ฐ€์ค‘์น˜๋Š” ๊ณ ์ •(freeze) ํ•˜์—ฌ ํšจ์œจ์ ์ธ ์—…๋ฐ์ดํŠธ ๊ฐ€๋Šฅ
    • ๊ฒฐ๊ณผ์ ์œผ๋กœ ๋ฉ”๋ชจ๋ฆฌยท๊ณ„์‚ฐ ์ž์› ์ ˆ๊ฐ + ์„ฑ๋Šฅ ์œ ์ง€ + ์ถ”๋ก  ์ง€์—ฐ ์—†์Œ

๐Ÿ—๏ธ ๋ฐฉ๋ฒ•๋ก : Low-Rank Adaptation (LoRA)

๐Ÿ’ก Low-Rank-Parametrized Update Matrices (์ €๋žญํฌ์˜ ํ–‰๋ ฌ ์—…๋ฐ์ดํŠธ)

Image

๋ชจ๋ธ์˜ weight ํ–‰๋ ฌ W๋ฅผ ์ง์ ‘ ์—…๋ฐ์ดํŠธํ•˜๋Š” ๋Œ€์‹ ,
์•„๋ž˜๊ณผ ๊ฐ™์ด ์ €๋žญํฌ ํ–‰๋ ฌ์˜ ๊ณฑ์œผ๋กœ ๋Œ€์ฒด

1
W' = W + ฮ”W = W + BA
  • A โˆˆ โ„^{rร—d} : ์ •๊ทœ๋ถ„ํฌ๋กœ ์ดˆ๊ธฐํ™”
  • B โˆˆ โ„^{dร—r} : ์ฒ˜์Œ์—” 0์œผ๋กœ ์„ค์ •
  • r โ‰ช d: ์ฆ‰, ์ €๋žญํฌ(rank-r) ๊ตฌ์กฐ
  • W๋Š” ๊ณ ์ •(frozen), A, B๋งŒ ํ•™์Šต

์ด๋ ‡๊ฒŒ ํ•จ์œผ๋กœ์จ!
ํ›ˆ๋ จ ํŒŒ๋ผ๋ฏธํ„ฐ ์ˆ˜์™€ ์—ฐ์‚ฐ๋Ÿ‰์„ ๋Œ€ํญ ์ค„์ด๊ณ  ์†๋„&์„ฑ๋Šฅ๋„ ์œ ์ง€!!

์ถ”๊ฐ€๋กœ! ๊ทธ๋Ÿผ ์–ด๋–ค ์ฐจ์›(r)์œผ๋กœ ๋‚ฎ์ถ”๋Š”๊ฒƒ ๊นŒ์ง€ ๊ฐ€๋Šฅํ• ๊นŒ?
๋‚ฎ์œผ๋ฉด ๋‚ฎ์„์ˆ˜๋ก ๋ฆฌ์†Œ์Šค๋Š” ์ ๊ฒŒ๋“ค์ง€๋งŒ ํ•™์Šต์ด ๋ ๊นŒ ๊ฑฑ์ •๋˜๊ธฐ์—!~
ํ•ด๋‹น ๊ณ ๋ฏผ๋„ ์—ฐ๊ตฌ์˜ ๋’ท๋ถ€๋ถ„ 7.2 WHAT IS THE OPTIMAL RANK r FOR LORA? ์— ๋‚˜์™”์Šต๋‹ˆ๋‹ค!! ๊ฒฐ๋ก ๋งŒ ๋งํ•˜๋ฉด r์ด 1์ผ๋–„๋„ ๊ฝค ์„ฑ๋Šฅ์ด ๊ดœ์ฐฎ์•˜๋ฐ์š”!

๋˜ํ•œ r=8 ์ด๋ž‘ r=64์ผ๋–„์˜ ๋ฒกํ„ฐ๋ฅผ ๊ตฌํ•ด์„œ ์–ผ๋งˆ๋‚˜ ๊ฒน์น˜๋Š”์ง€๋ฅผ ์‹œ๊ฐํ™”ํ–ˆ๋Š”๋ฐ, ๋งŽ์ด ๊ฒน์น˜๋Š”๊ฒƒ์„ ํ™•์ธํ–ˆ๋Œ€์š”!
Image

โ–ถ๏ธ Forward Pass ์ˆ˜์ • (๊ฒฐ๊ณผ๊ฐ‘ ์˜ˆ์ธก ๋ฐฉ๋ฒ• ์ˆ˜์ •)

  • ๊ธฐ์กด: h = W_0 * x
  • LoRA ์ ์šฉ ํ›„: ` h = W_0 * x + BA * x ๏ธŽโ†’ ๋™์ผ ์ž…๋ ฅ์— ๋Œ€ํ•ด ๋‘ ์ถœ๋ ฅ ๊ณ„์‚ฐ ํ›„ **์ขŒํ‘œ๋ณ„ ํ•ฉ์‚ฐ** (W_0 * x ์™€ BA * x`๋Š” ๊ฐ™์€ ์ฐจ์›์˜ ๋ฒกํ„ฐ๋กœ ๋”ํ•˜๊ธฐ๊ฐ€ ๊ฐ€๋Šฅ์“ฐ!)

Transformer์— LoRA๋ฅผ ์ ์šฉํ•˜๋ฉด@?

๐Ÿ”ง ์ ์šฉ ๋Œ€์ƒ

  • Self-Attention ๋ชจ๋“ˆ ๋‚ด ๊ฐ€์ค‘์น˜ ํ–‰๋ ฌ:
    • ( W_q ): Query
    • ( W_k ): Key
    • ( W_v ): Value
    • ( W_o ): Output
  • MLP ๋ชจ๋“ˆ์—๋Š” Dense Layer 2๊ฐœ ์กด์žฌ

์‹คํ—˜์—์„œ๋Š” W_q, W_k, W_v ๋“ค์„ ๋‹จ์ผ ํ–‰๋ ฌ๋กœ ์ทจ๊ธ‰
(์‹ค์ œ๋กœ๋Š” ์—ฌ๋Ÿฌ attention head๋กœ ๋ถ„ํ• ๋˜์ง€๋งŒ,, ๋‹จ์ˆœํ™”๋ฅผ ์œ„ํ•˜์—ฌ!!)
LoRA ๋…ผ๋ฌธ์—์„œ๋Š” ์‹คํ—˜์ ์œผ๋กœ ๋‹ค์–‘ํ•œ ์กฐํ•ฉ์„ ํ…Œ์ŠคํŠธํ–ˆ๊ณ , Wq์™€ Wv์— ์ ์šฉํ•˜๋Š”๊ฒƒ์ด ๋Œ€ํ‘œ์ ์ž„!

7.2 WHAT IS THE OPTIMAL RANK r FOR LORA? ์—์„œ ํ•ด๋‹น ์‹คํ—˜๋‚ด์šฉ์„ ๋ณผ์ˆ˜ ์žˆ์ง€์š”~~
Image


โš™๏ธ ์‹คํ—˜ ์ „๋žต

  • Attention weights๋งŒ LoRA๋กœ ํ•™์Šต
  • MLP, LayerNorm, Bias๋Š” ๋ชจ๋‘ ๊ณ ์ •(freeze)
    โ†’ ๊ฐ„๋‹จํ•˜๊ณ  ํŒŒ๋ผ๋ฏธํ„ฐ ํšจ์œจ์ 

โœ… LoRA์˜ ์‹ค์šฉ์  ์ด์ 

  • ๋ฉ”๋ชจ๋ฆฌ ์ ˆ๊ฐ:
    • GPT-3 175B ๊ธฐ์ค€ VRAM ์‚ฌ์šฉ๋Ÿ‰
      โ†’ ์ „์ฒด ํŒŒ์ธํŠœ๋‹: 1.2TB โ†’ LoRA: 350GB
  • ์ฒดํฌํฌ์ธํŠธ ํฌ๊ธฐ ๊ฐ์†Œ:
    • ( r = 4 ), Q/V projection๋งŒ ํ•™์Šต ์‹œ
    • 350GB โ†’ 35MB (์•ฝ 10,000๋ฐฐ ์ถ•์†Œ)
  • ์ ์€ GPU๋กœ๋„ ํ•™์Šต ๊ฐ€๋Šฅ
    • I/O ๋ณ‘๋ชฉ ์™„ํ™”
  • ํƒœ์Šคํฌ ์ „ํ™˜ ๋น„์šฉโ†“
    • ์ „์ฒด ๋ชจ๋ธ ๊ต์ฒด ๋Œ€์‹  LoRA ๋ชจ๋“ˆ๋งŒ ๊ต์ฒด
  • ํ•™์Šต ์†๋„ 25% ํ–ฅ์ƒ
    • ๋Œ€๋ถ€๋ถ„์˜ ํŒŒ๋ผ๋ฏธํ„ฐ๋Š” gradient ๊ณ„์‚ฐ ๋ถˆํ•„์š”

โš ๏ธ ํ•œ๊ณ„

  • ์ถ”๋ก  ์†๋„ ์œ ์ง€๋ฅผ ์œ„ํ•ด ( A, B )๋ฅผ ( W )์— ๋ณ‘ํ•ฉ(merge) ํ•  ๊ฒฝ์šฐ:
    • ์„œ๋กœ ๋‹ค๋ฅธ ํƒœ์Šคํฌ์šฉ ( A, B )๋ฅผ ํ•œ ๋ฒˆ์— ๋ฐฐ์น˜ ์ฒ˜๋ฆฌํ•˜๊ธฐ ์–ด๋ ค์›€
  • ๋‹จ, ์ง€์—ฐ์ด ์ค‘์š”ํ•˜์ง€ ์•Š์€ ๊ฒฝ์šฐ:
    • ๋ณ‘ํ•ฉํ•˜์ง€ ์•Š๊ณ  ์ƒ˜ํ”Œ๋งˆ๋‹ค ๋‹ค๋ฅธ LoRA ๋ชจ๋“ˆ ๋™์  ์„ ํƒ ๊ฐ€๋Šฅ

๐Ÿš€ LORA์˜ ์„ฑ๋Šฅ ์‹คํ—˜!!

์ด ์—ฐ๊ตฌ๋Š” ๋‹ค์–‘ํ•œ ํŒŒ์ธํŠœ๋‹(fine-tuning) ๊ธฐ๋ฒ•๋“ค๊ณผ ์„ฑ๋Šฅ์„ ๋น„๊ตํ–ˆ์Šต๋‹ˆ๋‹ค!
๋น„๊ต ๋Œ€์ƒ์œผ๋กœ๋Š” ์ „ํ†ต์ ์ธ Full Fine-Tuning (FT) ์„ ๋น„๋กฏํ•ด ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๋ฐฉ๋ฒ•๋“ค์ด ์žˆ์Šต๋‹ˆ๋‹ค:

  • Full Fine-Tuning (FT)
    ๋ชจ๋ธ์˜ ๋ชจ๋“  ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์—…๋ฐ์ดํŠธํ•˜๋Š” ๋ฐฉ์‹. ๊ฐ€์žฅ ์ผ๋ฐ˜์ ์ด์ง€๋งŒ, ํŒŒ๋ผ๋ฏธํ„ฐ ์ˆ˜๊ฐ€ ๋งŽ์•„ ๋ฉ”๋ชจ๋ฆฌ/์—ฐ์‚ฐ ๋น„์šฉ์ด ํผ.

  • BitFit (Bias-only Tuning)
    ์˜ค์ง bias ํ•ญ๋งŒ ํ•™์Šตํ•˜๋Š” ๋ฐฉ์‹. ๋งค์šฐ ๊ฐ€๋ณ์ง€๋งŒ ํ‘œํ˜„๋ ฅ์€ ์ œํ•œ์ ์ผ ์ˆ˜ ์žˆ์Œ.

  • Prefix Tuning (PreEmbed)
    ์ž…๋ ฅ ์•ž(๋˜๋Š” ์ค‘๊ฐ„)์— ํŠน์ˆ˜ ํ† ํฐ์„ ์‚ฝ์ž…ํ•˜๊ณ , ์ด๋“ค์˜ ์ž„๋ฒ ๋”ฉ๋งŒ ํ•™์Šต. ๋ชจ๋ธ ๊ตฌ์กฐ๋ฅผ ์œ ์ง€ํ•˜๋ฉด์„œ ์ ์‘ ๊ฐ€๋Šฅ.

  • Prefix Layer Tuning (PreLayer)
    ๋‹จ์ˆœ ์ž„๋ฒ ๋”ฉ์ด ์•„๋‹ˆ๋ผ, ๊ฐ Transformer ์ธต์˜ hidden activation ์ž์ฒด๋ฅผ ํ•™์Šต. ๋” ๊ฐ•๋ ฅํ•œ ํ‘œํ˜„๋ ฅ์„ ๊ฐ€์ง.

  • Adapter Tuning
    Transformer ๋‚ด๋ถ€์— ์ž‘์€ MLP ๊ตฌ์กฐ์˜ ์–ด๋Œ‘ํ„ฐ ๋ ˆ์ด์–ด๋ฅผ ์‚ฝ์ž…ํ•˜์—ฌ ์ผ๋ถ€ ํŒŒ๋ผ๋ฏธํ„ฐ๋งŒ ํ•™์Šต. ๋‹ค์–‘ํ•œ ๋ณ€ํ˜•(AdapterH, AdapterL, AdapterP ๋“ฑ)์ด ์žˆ์Œ.

  • LoRA (Low-Rank Adaptation)
    ๊ธฐ์กด ๊ฐ€์ค‘์น˜ ํ–‰๋ ฌ์— ์ €๋žญํฌ ํ–‰๋ ฌ (B, A)๋ฅผ ๋ณ‘๋ ฌ๋กœ ์ถ”๊ฐ€ํ•˜์—ฌ ์ผ๋ถ€๋งŒ ํ•™์Šต. ์ถ”๋ก  ์†๋„ ์ €ํ•˜ ์—†์ด, ์„ฑ๋Šฅ์„ ์œ ์ง€ํ•˜๋ฉด์„œ ํŒŒ๋ผ๋ฏธํ„ฐ ์ˆ˜์™€ ๋ฉ”๋ชจ๋ฆฌ ๋น„์šฉ์„ ํฌ๊ฒŒ ์ค„์ž„.

๊ทธ๋ฆฌ๊ณ  ๊ฒฐ๊ณผ๋Š”~!

LORA๋Š” ์ ์€ ํŒŒ๋ผ๋ฏธํ„ฐ ํ•™์Šต์„ ํ†ตํ•ด ์ข‹์€ ํšจ๊ณผ๋ฅผ ๋‚ธ๊ฒƒ์„ ๋ณผ์ˆ˜ ์žˆ์ง€์š”~!

Image

  • GLUE benchmark (NLU ๊ณผ์ œ) ์—์„œ๋Š” RoBERTa์™€ DeBERTa ๊ธฐ๋ฐ˜ ์‹คํ—˜์—์„œ LoRA๊ฐ€ Full FT์™€ ๋น„์Šทํ•˜๊ฑฐ๋‚˜ ๋” ์ข‹์€ ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑ
  • GPT-2 ๊ธฐ๋ฐ˜ ์ƒ์„ฑ ๊ณผ์ œ (WikiSQL, SAMSum) ์—์„œ๋„ Prefix Tuning๋ณด๋‹ค LoRA๊ฐ€ ๋” ๋†’์€ BLEU/ROUGE ์„ฑ๋Šฅ์„ ๊ธฐ๋ก
  • GPT-3 175B์—์„œ๋Š” Full FT๊ฐ€ ๋ถˆ๊ฐ€๋Šฅํ•œ ํ™˜๊ฒฝ์—์„œ๋„ 350GB VRAM์œผ๋กœ ํ•™์Šต ๊ฐ€๋Šฅํ•˜๊ณ , ๊ธฐ์กด ๊ฒฐ๊ณผ์™€ ์œ ์‚ฌํ•œ ์„ฑ๋Šฅ ํ™•๋ณด

๐Ÿ”ฎ ๊ฒฐ๋ก 

LoRA๋Š” Transformer ๋ชจ๋ธ๋“ค (LLM, VIT, DETR ๋“ฑ๋“ฑ)์„ Fine-tuning ํ•˜๊ธฐ์œ„ํ•œ ํ˜์‹ ์ ์ธ ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค!!
์ด ๋•๋ถ„์— ์ถ”ํ›„ ์—ฐ๊ตฌ์—์„œ ๋ชจ๋ธ ๊ฒฝ๋Ÿ‰ํ™”, ๋น ๋ฅธ ์‹คํ—˜, ๋ถ„์‚ฐ ํ•™์Šต, ๊ฐœ์ธํ™” ๋“ฑ์— ๋‹ค์–‘ํ•˜๊ฒŒ ํ™œ์šฉ๋˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

This post is licensed under CC BY 4.0 by the author.