📝 LoRA: Low-Rank Fine-Tuning for Large Language Models - Understanding LORA- LORA 알아보기?!!

Posted Jun 9, 2025

By DrFirst

17 min read

🧠 LoRA: Low-Rank Fine-Tuning for Large Language Models

🔍 Lightweight and Fast! A New Paradigm for Efficient Fine-Tuning

Paper: LoRA: Low-Rank Adaptation of Large Language Models
Conference: ICLR 2022 (Edward J. Hu et al., Microsoft Research)
Code: microsoft/LoRA
Comment: A groundbreaking method that brings language model efficiency into new domains!

📌 Summary

Want to fine-tune a large LLM… but without the massive GPU cost?

Traditionally, fine-tuning meant retraining every parameter in the model — which could mean billions of weights, just like training from scratch.

LoRA solves this by enabling effective fine-tuning while learning only a tiny fraction of the parameters, often achieving comparable or even better performance.

🎯 Core Idea:
👉 “Keep the original model weights frozen. Train only a small, lightweight module (low-rank matrices) added alongside.”

🧠 Why LoRA?

📌 The Challenge of Full Fine-Tuning in LLMs

Modern LLMs like GPT-3 have hundreds of billions of parameters
Fine-tuning every parameter is:
- 💾 Storage-heavy: Each task needs a full model copy
- 🚀 Deployment-unfriendly: Task switching is slow and heavy
- 💸 Expensive: Requires huge compute and memory

💡 Limitations of Previous Methods

Adapter Layers
- Inserts bottleneck networks into the Transformer blocks
- ✅ Efficient in parameter count
- ❌ But adds latency, especially problematic in online inference or sharded deployments
Prompt/Prefix Tuning
- Adds trainable tokens to the input sequence
- ✅ Keeps the model architecture unchanged
- ❌ Suffers from optimization instability, and reduces the usable sequence length

🚀 Motivation Behind LoRA

LoRA is based on the observation that parameter updates during fine-tuning lie in a low-dimensional space.

Thus:

Instead of updating full weight matrices,
LoRA learns a low-rank update:
[ \Delta W = B A ]
Only matrices A and B are trainable; the base model is frozen

✅ Result:
Less memory, fewer FLOPs, and no inference slowdown!

🏗️ How LoRA Works

💡 Low-Rank Update Parameterization

W' = W + \Delta W = W + B A

A ∈ ℝ^{r×d} : initialized with random Gaussian
B ∈ ℝ^{d×r} : initialized with zeros
r ≪ d → low-rank structure
The base weight W is frozen — only A and B are trained.

This setup allows dramatic reduction in trainable parameters and FLOPs, while maintaining speed and performance!

🤔 But how small can `r` go?

Smaller r means less resource usage — but is it still effective?

This was explored in Section 7.2: What is the Optimal Rank r for LoRA?

✅ Result: Even r = 1 yields surprisingly strong performance!
✅ LoRA with r = 8 and r = 64 was compared using vector subspace similarity, and the overlap was high!

▶️ Forward Pass Modification

Original: h = W₀ × x
With LoRA: h = W₀ × x + B A × x

→ The two outputs are added element-wise (same dimensions).
→ This allows LoRA to introduce updates without altering architecture.

🧠 How LoRA is Applied to Transformers

🔧 Target Weight Matrices

In Self-Attention Modules:
- ( W_q ): Query
- ( W_k ): Key
- ( W_v ): Value
- ( W_o ): Output
In MLP Modules: two Dense layers

In experiments, W_q, W_k, W_v are treated as unified square matrices
(even though in practice they are divided across attention heads)

Most commonly, LoRA is applied to Wq and Wv.
See Section 7.2 for ablations on rank selection and subspace behavior:

⚙️ Training Strategy

Only attention weight matrices are trained with LoRA
MLP, LayerNorm, and bias parameters are frozen

→ Simple and highly parameter-efficient.

✅ Practical Benefits of LoRA

Memory Efficiency:
- GPT-3 175B full fine-tuning: 1.2TB
- LoRA fine-tuning: 350GB
Checkpoint Size Reduction:
- With r = 4, training only Q/V → 350GB → 35MB (~10,000× smaller)
Trainable on modest hardware
- Avoids I/O bottlenecks
Low-cost Task Switching
- Just swap LoRA modules instead of the entire model
25% Faster Training
- Most parameters are frozen — gradients are computed only for low-rank matrices

⚠️ Limitations

If you merge B A into W to avoid runtime overhead:
- It’s difficult to batch tasks with different LoRA modules
However, when latency is not critical:
- You can keep LoRA unmerged and dynamically swap modules per sample

🚀 LoRA in Empirical Evaluation

This work compares LoRA against several fine-tuning methods:

Full Fine-Tuning (FT)
Trains all parameters. Standard method but memory-heavy.
BitFit (Bias-only Tuning)
Trains only bias vectors — very light, but limited capacity.
Prefix Tuning (PreEmbed)
Adds trainable tokens to the input — only embeddings are trained.
Prefix Layer Tuning (PreLayer)
Learns intermediate activations at each layer — more expressive.
Adapter Tuning
Adds small MLP “adapters” to each layer — multiple variants (AdapterH, AdapterL, etc.)
LoRA (Low-Rank Adaptation)
Adds parallel low-rank matrices to attention weights — maintains full inference speed
while dramatically reducing memory and parameter size.

📊 Result?

LoRA achieves great performance while training far fewer parameters!

On the GLUE benchmark (NLU), LoRA matches or outperforms full FT on RoBERTa/DeBERTa
On GPT-2 generation tasks (WikiSQL, SAMSum), LoRA outperforms Prefix Tuning
On GPT-3 175B, LoRA trains on 350GB VRAM — while full FT would be infeasible

🔮 Conclusion

LoRA is a breakthrough method for fine-tuning large Transformer models — from LLMs to ViTs to DETR.

It enables:

⚡ Lightweight adaptation
🧪 Rapid experimentation
🌐 Efficient deployment
🤖 Personalized AI at scale

🧠 (한국어) LORA : LLM을 위한 저랭크 파인튜닝 기법

🔍 가볍고 빠르게!! Fine-tuning의 새로운 방법론 제시!

논문: LoRA: Low-Rank Adaptation of Large Language Models
발표: ICLR 2022 (Edward J. Hu et al. - Microsoft Research)
코드: microsoft/LORA
코멘트: LLM의 언어 이해 능력을 시각 분할에 접목한 획기적인 접근!

📌 요약

엄청 좋은 LLM 을 조금 수정하고싶을떄!!
기존 방법들은 LLM만들때와 유사한 인프라를 가지고 미세 조정(fine-tuning)을 해야했습니다!!
왜냐하면 기존 방식은 full fine-tuning 방식으로,
수십억 개의 파라미터를 업데이트해야 했기 떄문입니다!

하지만 LoRA는! 이런 문제를 해결하기위해 등장한 미세조정 기법으로!
훨씬 적은 파라미터만 추가 학습하면서도 비슷하거나 더 좋은 성능을 달성합니다.

🎯 핵심 아이디어!!
👉 “기존 모델 가중치는 고정하고, 뒷부분에 추가된 다른 부분(저랭크 행렬, Low-Rank Matrices)만 학습한다!”

🧠 LORA 등장의 배경

📌 문제의식: 대규모 LLM의 한계

최근 언어 모델(GPT 등)은 수십억~수천억 개의 파라미터를 가지며, 이를 전체 파인튜닝(fine-tuning) 하는 것은 매우 비효율적
태스크마다 별도로 모델 파라미터를 학습해야 하며, 이는 원본 파라미터 ( \Phi_0 ) 와 동일한 크기이기 때문에:
- 💾 저장 공간: 태스크 수만큼 GPT-3 수준의 모델을 별도로 저장해야 함
- 🚀 배포/운영: 모델 전환 비용이 커지고 실시간 서비스에 부적합
- 💸 학습 자원: GPU 메모리와 연산비용이 과도하게 증가

💡 기존 접근 방식의 한계

어댑터 레이어 (Adapter Layers)
- Transformer 블록 사이에 작은 병목 네트워크(bottleneck)를 삽입하여 적은 수의 파라미터만 학습
- ✅ 장점: 적은 파라미터 학습
- ❌ 단점:
  - 어댑터 연산은 순차적으로 수행되므로 추론 지연(latency) 이 발생
  - 실시간 온라인 환경(예: 배치 크기 1)에선 성능 저하 뚜렷
  - 모델 병렬화(sharding) 환경에서 통신 비용 증가
프롬프트 기반 조정 (Prompt Tuning / Prefix Tuning)
- 입력 토큰 앞에 학습 가능한 프롬프트를 삽입하여 조정
- ✅ 장점: 모델 구조 변경 없음
- ❌ 단점:
  - 최적화가 불안정하고 성능이 비선형적으로 변화
  - 프롬프트가 입력 길이를 차지해 처리 가능한 시퀀스 길이 감소

🚀 LoRA의 핵심 동기

위의 방식들은 효율성을 제공하지만, 실용성과 성능 간 트레이드오프가 존재
LoRA (Low-Rank Adaptation) 는 다음의 관찰에서 출발함:
- 대형 모델의 파인튜닝 시, 실제로 변경되는 파라미터의 변화는 저차원 공간에 존재함
따라서,
- 전체 가중치 대신 변화량(∆W)을 저랭크 행렬 ( A, B ) 로 분해하여 학습
- 사전학습된 가중치는 고정(freeze) 하여 효율적인 업데이트 가능
- 결과적으로 메모리·계산 자원 절감 + 성능 유지 + 추론 지연 없음

🏗️ 방법론: Low-Rank Adaptation (LoRA)

💡 Low-Rank-Parametrized Update Matrices (저랭크의 행렬 업데이트)

모델의 weight 행렬 W를 직접 업데이트하는 대신,
아래과 같이 저랭크 행렬의 곱으로 대체

W' = W + ΔW = W + BA

A ∈ ℝ^{r×d} : 정규분포로 초기화
B ∈ ℝ^{d×r} : 처음엔 0으로 설정
r ≪ d: 즉, 저랭크(rank-r) 구조
W는 고정(frozen), A, B만 학습

이렇게 함으로써!
훈련 파라미터 수와 연산량을 대폭 줄이고 속도&성능도 유지!!

추가로! 그럼 어떤 차원(r)으로 낮추는것 까지 가능할까?
낮으면 낮을수록 리소스는 적게들지만 학습이 될까 걱정되기에!~
해당 고민도 연구의 뒷부분 7.2 WHAT IS THE OPTIMAL RANK r FOR LORA? 에 나왔습니다!! 결론만 말하면 r이 1일떄도 꽤 성능이 괜찮았데요!

또한 r=8 이랑 r=64일떄의 벡터를 구해서 얼마나 겹치는지를 시각화했는데, 많이 겹치는것을 확인했대요!

▶️ Forward Pass 수정 (결과갑 예측 방법 수정)

기존: h = W_0 * x
LoRA 적용 후: ` h = W_0 * x + BA * x ︎→ 동일 입력에 대해 두 출력 계산 후 **좌표별 합산** (W_0 * x 와 BA * x`는 같은 차원의 벡터로 더하기가 가능쓰!)

Transformer에 LoRA를 적용하면@?

🔧 적용 대상

Self-Attention 모듈 내 가중치 행렬:
- ( W_q ): Query
- ( W_k ): Key
- ( W_v ): Value
- ( W_o ): Output
MLP 모듈에는 Dense Layer 2개 존재

실험에서는 W_q, W_k, W_v 들을 단일 행렬로 취급
(실제로는 여러 attention head로 분할되지만,, 단순화를 위하여!!)
LoRA 논문에서는 실험적으로 다양한 조합을 테스트했고, Wq와 Wv에 적용하는것이 대표적임!

7.2 WHAT IS THE OPTIMAL RANK r FOR LORA? 에서 해당 실험내용을 볼수 있지요~~

⚙️ 실험 전략

Attention weights만 LoRA로 학습
MLP, LayerNorm, Bias는 모두 고정(freeze)
→ 간단하고 파라미터 효율적

✅ LoRA의 실용적 이점

메모리 절감:
- GPT-3 175B 기준 VRAM 사용량
  → 전체 파인튜닝: 1.2TB → LoRA: 350GB
체크포인트 크기 감소:
- ( r = 4 ), Q/V projection만 학습 시
- 350GB → 35MB (약 10,000배 축소)
적은 GPU로도 학습 가능
- I/O 병목 완화
태스크 전환 비용↓
- 전체 모델 교체 대신 LoRA 모듈만 교체
학습 속도 25% 향상
- 대부분의 파라미터는 gradient 계산 불필요

⚠️ 한계

추론 속도 유지를 위해 ( A, B )를 ( W )에 병합(merge) 할 경우:
- 서로 다른 태스크용 ( A, B )를 한 번에 배치 처리하기 어려움
단, 지연이 중요하지 않은 경우:
- 병합하지 않고 샘플마다 다른 LoRA 모듈 동적 선택 가능

🚀 LORA의 성능 실험!!

이 연구는 다양한 파인튜닝(fine-tuning) 기법들과 성능을 비교했습니다!
비교 대상으로는 전통적인 Full Fine-Tuning (FT) 을 비롯해 다음과 같은 방법들이 있습니다:

Full Fine-Tuning (FT)
모델의 모든 파라미터를 업데이트하는 방식. 가장 일반적이지만, 파라미터 수가 많아 메모리/연산 비용이 큼.
BitFit (Bias-only Tuning)
오직 bias 항만 학습하는 방식. 매우 가볍지만 표현력은 제한적일 수 있음.
Prefix Tuning (PreEmbed)
입력 앞(또는 중간)에 특수 토큰을 삽입하고, 이들의 임베딩만 학습. 모델 구조를 유지하면서 적응 가능.
Prefix Layer Tuning (PreLayer)
단순 임베딩이 아니라, 각 Transformer 층의 hidden activation 자체를 학습. 더 강력한 표현력을 가짐.
Adapter Tuning
Transformer 내부에 작은 MLP 구조의 어댑터 레이어를 삽입하여 일부 파라미터만 학습. 다양한 변형(AdapterH, AdapterL, AdapterP 등)이 있음.
LoRA (Low-Rank Adaptation)
기존 가중치 행렬에 저랭크 행렬 (B, A)를 병렬로 추가하여 일부만 학습. 추론 속도 저하 없이, 성능을 유지하면서 파라미터 수와 메모리 비용을 크게 줄임.

그리고 결과는~!

LORA는 적은 파라미터 학습을 통해 좋은 효과를 낸것을 볼수 있지요~!

GLUE benchmark (NLU 과제) 에서는 RoBERTa와 DeBERTa 기반 실험에서 LoRA가 Full FT와 비슷하거나 더 좋은 성능을 달성
GPT-2 기반 생성 과제 (WikiSQL, SAMSum) 에서도 Prefix Tuning보다 LoRA가 더 높은 BLEU/ROUGE 성능을 기록
GPT-3 175B에서는 Full FT가 불가능한 환경에서도 350GB VRAM으로 학습 가능하고, 기존 결과와 유사한 성능 확보

🔮 결론

LoRA는 Transformer 모델들 (LLM, VIT, DETR 등등)을 Fine-tuning 하기위한 혁신적인 방법입니다!!
이 덕분에 추후 연구에서 모델 경량화, 빠른 실험, 분산 학습, 개인화 등에 다양하게 활용되고 있습니다.

AI, Research

This post is licensed under CC BY 4.0 by the author.