πUnderstanding CLIP-Adapter - CLIP-Adapter μμ보기?!!
π§ Understanding CLIP-Adapter!
π Easy Fine-Tuning for CLIP with Just One Adapter!
Paper: CLIP-Adapter: Better Vision-Language Models with Feature Adapters
Published: IJCV 2024 (Gao, Peng, et al.)
Code: gaopengcuhk/CLIP-Adapter
π Why CLIP-Adapter?
With the rise of large-scale vision-language models like CLIP,
our ability to understand images and text together has significantly improved.
One of the most exciting aspects is zero-shot classification β prediction without labeled data!
However, these models also have some serious limitations.
β Problem 1: Prompt Sensitivity
CLIP heavily depends on natural language prompts like "a photo of a {label}"
.
Changing a few words (e.g., "this is a dog"
vs "a photo of a dog"
) can affect performance.
This requires manual prompt engineering, which becomes useless in domain-specific tasks.
π In short: CLIP is too sensitive to small changes in prompts.
To address this, CoOp (Context Optimization) was introduced!
CoOp demonstrated that prompt tuning alone can fine-tune CLIP effectively!
- π§ CoOp replaces natural language prompts with learnable continuous vectors.
- For example, instead of
this is a dog
, we input[V1] [V2] [V3] dog
. - Here,
[V1], [V2], [V3]
are learned vectors, and the user only inputs the class name likedog
.
- For example, instead of
- No more manual prompt crafting β the model learns the prompt by itself!
β Problem 2: Tuning Only the Text Side
But CoOp only tunes the prompt β that is, the text side of CLIP.
The image encoder remains fixed.
βWeβre adapting the language, but still trusting the same image representation?β
This imbalance limits performance, especially in few-shot or domain-specific scenarios.
As shown below, CoOp learns only the
[V1], [V2]
tokens in the text.
CLIP-Adapter, in contrast, introduces adapters on both the image and text branches!
π‘ CLIP-Adapter Architecture!!!
CLIP-Adapter performs fine-tuning at the feature level for both image and text.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
ββββββββββββββββββββββ ββββββββββββββββββββββ
β Image Input (x) β β Text Input (t) β
ββββββββββ¬ββββββββββββ ββββββββββ¬ββββββββββββ
β β
ββββββββββββββββββββββββββββββ ββββββββββββββββββββββββββββββ
β CLIP Image Encoder β β CLIP Text Encoder β
β (frozen) β β (frozen) β
βββββββββββ¬βββββββββββββββββββ ββββββββββ¬ββββββββββββββββββββ
β β
βββββββββββββββββββββββ βββββββββββββββββββββββ
β Image Adapter MLP β β Text Adapter MLP β
β (trainable) β β (trainable) β
ββββββββββ¬βββββββββββββ ββββββββββββ¬βββββββββββ
β β
ββββββββββββββββββββββββββββββ ββββββββββββββββββββββββββββββ
β Residual: image + adapted β β Residual: text + adapted β
ββββββββββββββββββββββββββββββ ββββββββββββββββββββββββββββββ
β β
ββββββββββββββββββββββ ββββββββββββββββββββββ
β Image Embedding β β Text Embedding β
ββββββββββββββββββββββ ββββββββββββββββββββββ
βββββββββββββββββ¬ββββββββββββββββ
β
Cosine Similarity / Classification
π§ Adapter MLPs (for image and text)
The adapter is a 2-layer MLP with ReLU, also called a bottleneck MLP:
- Structure:
Linear β ReLU β Linear
- It reduces the feature dimension and then expands it back.
ποΈ Residual Connection
In few-shot learning, models tend to overfit due to limited data.
To solve this, CLIP-Adapter uses residual blending:
βBlend new knowledge (adapter output) with original CLIP features.β
The final feature becomes:
Ξ± Γ Adapter Output + (1 - Ξ±) Γ CLIP Feature
This mixing helps retain the robustness of CLIP while injecting task-specific knowledge.
π¬ Performance Experiments
π§ͺ CLIP-Adapter Experimental Setup
Datasets:
- ImageNet, StanfordCars, UCF101, Caltech101, Flowers102
- SUN397, DTD, EuroSAT, FGVCAircraft, OxfordPets, Food101
Settings:
- Few-shot setups: 1, 2, 4, 8, 16-shot
- Evaluation: average over 3 runs, single A100 GPU
Implementation:
- Visual adapter only; text frozen
- Batch size: 32, learning rate: 1e-5
- Ξ±, Ξ² tuned via grid search
- Visual backbone: ResNet-50
- Text encoder: 12-layer Transformer
- Adapter dim: 256 (ΒΌ of original)
- Prompt: Fixed natural text (βa photo of a {class}β)
π CLIP-Adapter Results
Baselines:
- Zero-shot CLIP: frozen model + prompt only
- Linear Probe CLIP: frozen encoder + trainable linear classifier
- CoOp: learns
[V1] [V2] ...
tokens in prompt
CLIP-Adapter outperforms all baselines in accuracy, training speed, parameter efficiency β
especially in few-shot learning.
π Where to Put the Adapter?
- Visual adapter: image only, text only, both β Best: image-only
- Insertion layer: ViT-B has 12 layers
β Best: insert adapter after layer 12 (last layer)
π§ What about Residual Ratio Ξ±?
- Fine-grained datasets (e.g. EuroSAT, DTD):
β Best Ξ± β 0.6β0.8 - Generic datasets (e.g. Caltech101, ImageNet):
β Best Ξ± β 0.2
π§ Final Thoughts
This was my second PEFT (Parameter Efficient Fine-Tuning) after studying LoRA β
and I found CLIP-Adapter both innovative and effective.
I used to think of βadapterβ as just a power plug β
Now, Iβll always remember CLIP-Adapter! π
π§ (νκ΅μ΄) CLIP-Adapter μμ보기!
π μ΄λν° νλλ‘ CLIPμ μ½κ² Fine tuning νκΈ°!!
λ Όλ¬Έ: CLIP-Adapter: Better Vision-Language Models with Feature Adapters
λ°ν: IJCV 2024 (Gao, Peng, et al.)
μ½λ: gaopengcuhk/CLIP-Adapter
π λ°°κ²½: CLIP-Adapter λ±μ₯μ μ΄μ !?
CLIPκ³Ό κ°μ λκ·λͺ¨ λΉμ -μΈμ΄ λͺ¨λΈμ΄ λ±μ₯νλ©΄μ
μ΄λ―Έμ§μ ν
μ€νΈλ₯Ό ν¨κ» μ΄ν΄νλ λ₯λ ₯μ΄ λΉμ½μ μΌλ‘ ν₯μλμμ΅λλ€.
κ·Έμ€μμλ βzero-shot classificationβμ λ μ΄λΈ μμ΄λ μΆλ‘ μ΄ κ°λ₯νλ€λ μ μμ νμ μ μ΄μμ£ .
νμ§λ§ μ΄λ° λͺ¨λΈμλ μ€λν νκ³κ° μμμ΅λλ€.
β λ¬Έμ 1: ν둬ννΈ μμ‘΄μ± (Prompt Sensitivity)
CLIPμ "a photo of a {label}"
κ°μ ν둬ννΈμ μμ‘΄ν©λλ€.
μλ₯Ό λ€μ΄ "a photo of a dog"
κ³Ό "this is a dog"
μ μλ‘ λ€λ₯Έ κ²°κ³Όλ₯Ό λΌ μ μμ΅λλ€.
μ΄μ μ΄λ€ ν둬ννΈκ° κ°μ₯ μ’μ μ±λ₯μ λ΄λμ§ μ¬λμ΄ μ§μ μ€κ³(prompt engineering) ν΄μΌ νμ΅λλ€!!
λν νΉμ λλ©μΈμμλ μ΄λ° ν둬ν¬νΈ μμ§λμ΄λ§λ μλ―Έκ° μμμ§μ!!
π μ΄κ±΄ λ§μΉ CLIPμ΄ λ¨μ΄ νλ λ°λ λ¬Έμ₯μλ λ―Όκ°νκ² λ°μνλ€λ λ»μ λλ€.
κ·Έλμ λ±μ₯ν κ²μ΄ λ°λ‘ CoOp (Context Optimization)!
μ΄ CoOpμ°κ΅¬λ₯Ό ν΅ν΄ ν둬ν¬νΈ νλμ λ°νμΌλ‘ CLIPμ fine-tuningν μ μλ€λ κ²μ μκ²λμμ΅λλ€!
- π§ CoOpμ ν둬ννΈ λ¬Έμ₯μ νμ΅ κ°λ₯ν μ°μ 벑ν°λ‘ λ체ν©λλ€.
- μλ₯Ό λ€λ©΄
this is a dog
λΌκ³ νλκ²μ[V1] [V2] [V3] dog
λΌκ³ μ λ ₯νλκ²μ λλ€! - μ΄λ
[V1] [V2] [V3]
μ Fine-tuningνλ©΄μ νμ΅λλ 벑ν°λ‘ κ²°κ΅ μ¬λμ dogλ§ μ λ ₯νλ©΄ λλκ±°μ£ !
- μλ₯Ό λ€λ©΄
- μ¬λμ΄ μ§μ ν둬ννΈλ₯Ό λμμΈν νμ μμ΄(prompt-free tuning)μΌλ‘μ¨,
- λͺ¨λΈμ΄ ν둬ννΈ μ체λ₯Ό νμ΅νκ² λ§λ κ²μ΄μ£ .
β λ¬Έμ 2: ν μ€νΈλ§ νλνλ λ°©μμ νκ³
νμ§λ§ CoOpμ ν
μ€νΈ ν둬ν¬νΈ λΆλΆλ§ λ―ΈμΈμ‘°μ (Fine-Tuning)ν©λλ€.
μ΄λ―Έμ§μ featureλ κ·Έλλ‘ λλ€λ λ»μ΄μ£ .
βν μ€νΈλ νμ΅νλλ°, μ΄λ―Έμ§ ννμ μ¬μ ν κ³ μ λμ΄ μλ€?β
μ΄λ° λΆκ· νμ νΉν νΉμ λλ©μΈμ΄λ few-shot νμ΅μμ μ±λ₯ μ νλ‘ μ΄μ΄μ§ μ μμ΅λλ€.
CoOpμ ꡬ쑰!! ν μ€νΈ ν둬ν¬νΈ μμ V1,V2 λ±λ±λ§ νμ΅ν©λλ€!!
μ€λμ Clip Adapterλ μ΄μ λ€λ₯΄κ² ν μ€νΈ, Image μ λνμ¬ λͺ¨λ adapterκ° μμ£ !?
π‘ CLIP-Adapter ꡬ쑰!!!
CLIP-Adapterλ μ΄λ―Έμ§μ ν μ€νΈμ feature levelμμ μ§μ μ‘°μ μ μνν©λλ€.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
ββββββββββββββββββββββ ββββββββββββββββββββββ
β Image Input (x) β β Text Input (t) β
ββββββββββ¬ββββββββββββ ββββββββββ¬ββββββββββββ
β β
ββββββββββββββββββββββββββββββ ββββββββββββββββββββββββββββββ
β CLIP Image Encoder β β CLIP Text Encoder β
β (frozen) β β (frozen) β
βββββββββββ¬βββββββββββββββββββ ββββββββββ¬ββββββββββββββββββββ
β β
βββββββββββββββββββββββ βββββββββββββββββββββββ
β Image Adapter MLP β β Text Adapter MLP β
β (trainable) β β (trainable) β
ββββββββββ¬βββββββββββββ ββββββββββββ¬βββββββββββ
β β
ββββββββββββββββββββββββββββββ ββββββββββββββββββββββββββββββ
β Residual: image + adapted β β Residual: text + adapted β
ββββββββββββββββββββββββββββββ ββββββββββββββββββββββββββββββ
β β
ββββββββββββββββββββββ ββββββββββββββββββββββ
β Image Embedding β β Text Embedding β
ββββββββββββββββββββββ ββββββββββββββββββββββ
βββββββββββββββββ¬ββββββββββββββββ
β
Cosine Similarity / Classification
κ²°κ΅ μμ ꡬ쑰μμ Adapter MLP
μ Residual Connection
λΆλΆμ΄ μ΄λ² μ°κ΅¬μ ν΅μ¬μΈλ°μ!!
π§ Adapter MLP (Image, Textμ κ°κ°!!)
Adapter λΆλΆμ MLPλ!!
- λ κ°μ μ ν κ³μΈ΅ + ReLU λΉμ ν ν¨μꡬ쑰λ‘μ,
- ꡬ쑰:
Linear β ReLU β Linear
- Bottleneck κ΅¬μ‘°λ‘ μ€κ° μ°¨μμΌλ‘ μΆμνλ€κ° λ€μ νμ₯νκ² λ©λλ€!!
ποΈ Residual Connection
few-shot μΌλ‘ νμ΅νκ² λλ€λ©΄!!
νμ΅ λ°μ΄ν°κ° κ·Ήν μ κΈ° λλ¬Έμ, λͺ¨λΈμ΄ λ°μ΄ν°μ μ§λμΉκ² λ§μΆ°μ§λ(overfitting) κ²½ν₯μ΄ μμ΅λλ€!
μ΄λ° μ€λ²νΌν
μ λν ν΄κ²° λ°©λ²μΌλ‘ Residual Connection (μμ°¨ μ°κ²°)μ μ μ©νμ΅λλ€!
ν΅μ¬ μμ΄λμ΄λ "μλ‘κ² νμ΅ν ννκ³Ό, κΈ°μ‘΄μ μ νμ΅λ CLIP ννμ λΉμ¨μ μ‘°μ ν΄ μμ."
λ‘μ
- (μ΄λ―Έμ§μ ν μ€νΈμ CLIP μλ² λ© κ²°κ³Όλ₯Ό adapter μ ν΅κ³Όμν¨ κ²°κ³Ό) X Ξ±
- (μ΄λ―Έμ§μ ν μ€νΈμ κΈ°μ‘΄ CLIP μλ² λ© κ²°κ³Ό) X (1-Ξ±)
λ‘ νμ¬ νμ΅ κ²°κ³Όλ° CLIPμ κΈ°μ‘΄ κ²°κ³Όλ₯Ό μλ§κ² μμ΄ μ€λλ€!
π¬ μ±λ₯ μ€ν!!
CLIP-Adapter μ€ν μΈν !
- π μ¬μ©ν λ°μ΄ν°μ
CLIP-Adapterλ μ΄ 11κ°μ μ΄λ―Έμ§ λΆλ₯ λ°μ΄ν°μ μμ μ±λ₯μ νκ°νμ΅λλ€:
- ImageNet
- StanfordCars
- UCF101
- Caltech101
- Flowers102
- SUN397
- DTD
- EuroSAT
- FGVCAircraft
- OxfordPets
- Food101
κ° λ°μ΄ν°μ
μ λν΄ 1, 2, 4, 8, 16-shot μ€μ μΌλ‘ fine-tuningμ μννκ³ ,
μ 체 ν
μ€νΈ μΈνΈμμ μ±λ₯μ μΈ‘μ ν©λλ€.
λͺ¨λ μ€νμ NVIDIA A100 GPU λ¨μΌ μ₯λΉμμ μνλλ©°,
κ° μ€νμ 3ν λ°λ³΅νμ¬ νκ· μ νλλ₯Ό μ°μΆν©λλ€!!
- βοΈ κ΅¬ν μΈλΆ μ€μ
κΈ°λ³Έ ꡬ쑰: μ΄λ―Έμ§ νΉμ±λ§ fine-tune (visual adapter), ν μ€νΈ(branch)λ κ³ μ
- νμ΄νΌνλΌλ―Έν°:
- λ°°μΉ μ¬μ΄μ¦:
32
- νμ΅λ₯ :
1e-5
- μμ°¨ λΉμ¨ Ξ±, Ξ²λ κ° λ°μ΄ν°μ λ§λ€ νμμ ν΅ν΄ μ ν (grid search)
- λ°°μΉ μ¬μ΄μ¦:
- λ°±λ³Έ(backbone):
- Visual encoder:
ResNet-50
- Text encoder:
12-layer Transformer
- Visual encoder:
μ΄λν° hidden embedding: μκ°/ν μ€νΈ μ΄λν° λͺ¨λ
256
(κΈ°μ‘΄ μλ² λ©μ 1/4)- ν둬ννΈ μ
λ ₯:
- CoOpκ³Ό λ¬λ¦¬, κ³ μ ν
μ€νΈ ν둬ννΈ μ¬μ©
μ:"a photo of a {class}"
- μΈλ°ν λΆλ₯μλ λλ©μΈ ν€μλλ₯Ό ν¬ν¨
μ:"a centered satellite photo of {class}"
- CoOpκ³Ό λ¬λ¦¬, κ³ μ ν
μ€νΈ ν둬ννΈ μ¬μ©
CLIP-Adapter μ€ν κ²°κ³Ό λΆμ!!
- κΈ°λ³Έ μ€ν
- CLIP-Adapterλ μ±λ₯μ λΉκ΅νκΈ° μν΄ λ€μ 3κ°μ§ μ£Όμ λ² μ΄μ€λΌμΈκ³Ό λΉκ΅ μ€νμ μ§ννμ΅λλ€!
- Zero-shot CLIP : CLIP λͺ¨λΈ κ·Έλλ‘,
a photo of {class}
λ‘ ν둬ν¬νΈμ¬μ© - Linear probe CLIP : CLIPμ μ΄λ―Έμ§ μΈμ½λλ κ³ μ μν€κ³ , κ·Έ μμ μμ μ ν λΆλ₯κΈ°(linear classifier)λ§ νμ΅.
- CoOp (Context Optimization) : ν μ€νΈ ν둬ν¬νΈμ λνμ¬ V1 V2λ₯Ό μΆκ°νμ¬ νμ΅
CLIP-Adapter 결곑 μ’μ μ±λ₯μ 보μ¬μ£Όμμ΅λλ€!!
μ μ΄λ―Έμ§μμ 보λ―, μ§§μ νμ΅, μ μ parameterλ° GPUλ©λͺ¨λ¦¬ λΉ λ₯Έ μλμ λμ μ νλλ₯Ό 보μ¬μ€¬λλ°μ!
λΏλ§μλλΌ μ μ λ°μ΄ν°μ
νμ΅ (few shot) μμλ μ’μμ΄μ!!
- μ΄λν°λ μ΄λμ!?
μΆκ°λ‘ μ΄λν°λ₯Ό μ΄λ―Έμ§λ§
, ν
μ€νΈλ§
, μ΄λ―Έμ§λ ν
μ€νΈ λͺ¨λ
μ λΆμ΄λ λΉκ΅λ ν΄λ³΄μκ³ !!
κ²°κ΅ μ΄λ―Έμ§λ§ νλκ² μ μΌ μ’μλ€κ³ ν©λλ€!!
λν 12κ° Transformerλ μ΄μ΄λ‘ ꡬμ±λ CLIP μ μλΆλΆ, μ€κ°λΆλΆ λ±μ λΆμ΄λκ²λ ν
μ€νΈν΄λ³΄μκ³ ,
μ§κΈκΉμ§ μ΄ν΄νκ² μ²λΌ CLIPμ 맨 λ·λΆλΆ,
μ¦ 12λ²μ¨° λ μ΄μ΄(CLIPμ΄ 12κ° Layerλ‘ κ΅¬μ±) λ€μ λΆμ΄λ κ²μ΄ κ°μ₯ ν¨μ¨μ΄ μ’μμ΅λλ€!!
- μμ°¨ νμ΅μ κ³μλ?!
- μ€λ²νΌν
μ λ§κΈ°μν
Residual Connection
μ κ³μ νκ°λ₯Ό μ§ννκ³ !!
- μ€λ²νΌν
μ λ§κΈ°μν
a. μΈλ°ν λλ©μΈμ fine-grained λ°μ΄ν°μ μ κ²½μ°λ μ΅μ μ Ξ± κ°μ΄ λ³΄ν΅ 0.6 ~ 0.8 μμ€μ!,
b. Caltech-101μ΄λ ImageNetμ²λΌ ν¬κ΄μ μ΄κ³ μΌλ°μ μΈ μ΄λ―Έμ§ λ°μ΄ν°μ μμλ μ΅μ Ξ± κ°μ΄ μ½ 0.2 μμ€μ΄μλ€κ³ ν©λλ€!
π§ λ§λ¬΄λ¦¬ μκ°
LORAμ μ΄μ΄ λλ²μ§Έλ‘ 곡λΆν΄λ³Έ PEFT (Parameter Efficient Fine Tuning) κΈ°λ²!!
μλλ μ°Έμ ν λΏλ§μλλΌ μ±λ₯λ μΈμμ μ΄μ!
μμΌλ‘ μ΄ λ°©μμ κΈ°μ΅ν΄μ μ¬λ¬κ³³μ μ¬μ©ν΄λ΄μΌκ² μ΅λλ€!!
+ μ΄λν°νλ©΄ μ κΈ°μ½μΌνΈ μ΄λν°λ§ λ μ¬λλλ°, μμΌλ‘λ μ΄ CLIP-Adapterκ° κΈ°μ΅μ λ¨μκ² κ°λ€μ! :)