Post

πŸ“Understanding CLIP-Adapter - CLIP-Adapter μ•Œμ•„λ³΄κΈ°?!!

πŸ“Understanding CLIP-Adapter - CLIP-Adapter μ•Œμ•„λ³΄κΈ°?!!

🧠 Understanding CLIP-Adapter!

πŸ” Easy Fine-Tuning for CLIP with Just One Adapter!

manhwa

Paper: CLIP-Adapter: Better Vision-Language Models with Feature Adapters
Published: IJCV 2024 (Gao, Peng, et al.)
Code: gaopengcuhk/CLIP-Adapter


πŸ“Œ Why CLIP-Adapter?

With the rise of large-scale vision-language models like CLIP,
our ability to understand images and text together has significantly improved.
One of the most exciting aspects is zero-shot classification β€” prediction without labeled data!

However, these models also have some serious limitations.


❓ Problem 1: Prompt Sensitivity

CLIP heavily depends on natural language prompts like "a photo of a {label}".
Changing a few words (e.g., "this is a dog" vs "a photo of a dog") can affect performance.
This requires manual prompt engineering, which becomes useless in domain-specific tasks.

πŸ“Œ In short: CLIP is too sensitive to small changes in prompts.

To address this, CoOp (Context Optimization) was introduced!

CoOp demonstrated that prompt tuning alone can fine-tune CLIP effectively!

  • 🧠 CoOp replaces natural language prompts with learnable continuous vectors.
    • For example, instead of this is a dog, we input [V1] [V2] [V3] dog.
    • Here, [V1], [V2], [V3] are learned vectors, and the user only inputs the class name like dog.
  • No more manual prompt crafting β€” the model learns the prompt by itself!

❗ Problem 2: Tuning Only the Text Side

But CoOp only tunes the prompt β€” that is, the text side of CLIP.
The image encoder remains fixed.

β€œWe’re adapting the language, but still trusting the same image representation?”

This imbalance limits performance, especially in few-shot or domain-specific scenarios.

As shown below, CoOp learns only the [V1], [V2] tokens in the text.
CLIP-Adapter, in contrast, introduces adapters on both the image and text branches!

compareCOOP


πŸ’‘ CLIP-Adapter Architecture!!!

structure

CLIP-Adapter performs fine-tuning at the feature level for both image and text.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
            β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
            β”‚   Image Input (x)  β”‚                β”‚   Text Input (t)   β”‚
            β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                     ↓                                     ↓
       β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
       β”‚   CLIP Image Encoder       β”‚         β”‚   CLIP Text Encoder        β”‚
       β”‚       (frozen)             β”‚         β”‚        (frozen)            β”‚
       β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜         β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                 ↓                                     ↓
     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”               β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
     β”‚  Image Adapter MLP  β”‚               β”‚  Text Adapter MLP   β”‚
     β”‚     (trainable)     β”‚               β”‚     (trainable)     β”‚
     β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜               β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
              ↓                                       ↓
     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”       β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
     β”‚ Residual: image + adapted  β”‚       β”‚ Residual: text + adapted   β”‚
     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜       β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
              ↓                                       ↓
     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
     β”‚  Image Embedding   β”‚              β”‚  Text Embedding    β”‚
     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                     ↓
                      Cosine Similarity / Classification

πŸ”§ Adapter MLPs (for image and text)

adapter

The adapter is a 2-layer MLP with ReLU, also called a bottleneck MLP:

  • Structure: Linear β†’ ReLU β†’ Linear
  • It reduces the feature dimension and then expands it back.

πŸ–‡οΈ Residual Connection

residual

In few-shot learning, models tend to overfit due to limited data.
To solve this, CLIP-Adapter uses residual blending:

β€œBlend new knowledge (adapter output) with original CLIP features.”

The final feature becomes:

  • Ξ± Γ— Adapter Output + (1 - Ξ±) Γ— CLIP Feature

This mixing helps retain the robustness of CLIP while injecting task-specific knowledge.


πŸ”¬ Performance Experiments

πŸ§ͺ CLIP-Adapter Experimental Setup

Datasets:

  • ImageNet, StanfordCars, UCF101, Caltech101, Flowers102
  • SUN397, DTD, EuroSAT, FGVCAircraft, OxfordPets, Food101

Settings:

  • Few-shot setups: 1, 2, 4, 8, 16-shot
  • Evaluation: average over 3 runs, single A100 GPU

Implementation:

  • Visual adapter only; text frozen
  • Batch size: 32, learning rate: 1e-5
  • Ξ±, Ξ² tuned via grid search
  • Visual backbone: ResNet-50
  • Text encoder: 12-layer Transformer
  • Adapter dim: 256 (ΒΌ of original)
  • Prompt: Fixed natural text (β€œa photo of a {class}”)

πŸ“ˆ CLIP-Adapter Results

Baselines:

  • Zero-shot CLIP: frozen model + prompt only
  • Linear Probe CLIP: frozen encoder + trainable linear classifier
  • CoOp: learns [V1] [V2] ... tokens in prompt

res_compare

CLIP-Adapter outperforms all baselines in accuracy, training speed, parameter efficiency β€”
especially in few-shot learning.


πŸ” Where to Put the Adapter?
  • Visual adapter: image only, text only, both β†’ Best: image-only

adaptersto

  • Insertion layer: ViT-B has 12 layers
    β†’ Best: insert adapter after layer 12 (last layer)

where


πŸ”§ What about Residual Ratio Ξ±?
  • Fine-grained datasets (e.g. EuroSAT, DTD):
    β†’ Best Ξ± β‰ˆ 0.6–0.8
  • Generic datasets (e.g. Caltech101, ImageNet):
    β†’ Best Ξ± β‰ˆ 0.2

🧠 Final Thoughts

This was my second PEFT (Parameter Efficient Fine-Tuning) after studying LoRA β€”
and I found CLIP-Adapter both innovative and effective.

I used to think of β€œadapter” as just a power plug β€”
Now, I’ll always remember CLIP-Adapter! πŸ˜„


🧠 (ν•œκ΅­μ–΄) CLIP-Adapter μ•Œμ•„λ³΄κΈ°!

πŸ” μ–΄λŒ‘ν„° ν•˜λ‚˜λ‘œ CLIP을 μ‰½κ²Œ Fine tuning ν•˜κΈ°!!

manhwa

λ…Όλ¬Έ: CLIP-Adapter: Better Vision-Language Models with Feature Adapters
λ°œν‘œ: IJCV 2024 (Gao, Peng, et al.)
μ½”λ“œ: gaopengcuhk/CLIP-Adapter


πŸ“Œ λ°°κ²½: CLIP-Adapter λ“±μž₯의 이유!?

CLIPκ³Ό 같은 λŒ€κ·œλͺ¨ λΉ„μ „-μ–Έμ–΄ λͺ¨λΈμ΄ λ“±μž₯ν•˜λ©΄μ„œ
이미지와 ν…μŠ€νŠΈλ₯Ό ν•¨κ»˜ μ΄ν•΄ν•˜λŠ” λŠ₯λ ₯이 λΉ„μ•½μ μœΌλ‘œ ν–₯μƒλ˜μ—ˆμŠ΅λ‹ˆλ‹€.
κ·Έμ€‘μ—μ„œλ„ β€œzero-shot classification”은 λ ˆμ΄λΈ” 없이도 좔둠이 κ°€λŠ₯ν•˜λ‹€λŠ” μ μ—μ„œ ν˜μ‹ μ μ΄μ—ˆμ£ .

ν•˜μ§€λ§Œ 이런 λͺ¨λΈμ—λ„ μ€‘λŒ€ν•œ ν•œκ³„κ°€ μžˆμ—ˆμŠ΅λ‹ˆλ‹€.


❓ 문제 1: ν”„λ‘¬ν”„νŠΈ μ˜μ‘΄μ„± (Prompt Sensitivity)

CLIP은 "a photo of a {label}" 같은 ν”„λ‘¬ν”„νŠΈμ— μ˜μ‘΄ν•©λ‹ˆλ‹€.
예λ₯Ό λ“€μ–΄ "a photo of a dog"κ³Ό "this is a dog"은 μ„œλ‘œ λ‹€λ₯Έ κ²°κ³Όλ₯Ό λ‚Ό 수 μžˆμŠ΅λ‹ˆλ‹€.
이에 μ–΄λ–€ ν”„λ‘¬ν”„νŠΈκ°€ κ°€μž₯ 쒋은 μ„±λŠ₯을 λ‚΄λŠ”μ§€ μ‚¬λžŒμ΄ 직접 섀계(prompt engineering) ν•΄μ•Ό ν–ˆμŠ΅λ‹ˆλ‹€!!
λ˜ν•œ 특수 λ„λ©”μΈμ—μ„œλŠ” 이런 ν”„λ‘¬ν¬νŠΈ μ—”μ§€λ‹ˆμ–΄λ§λ„ μ˜λ―Έκ°€ μ—†μ—ˆμ§€μš”!!

πŸ“Œ 이건 마치 CLIP이 단어 ν•˜λ‚˜ 바뀐 λ¬Έμž₯에도 λ―Όκ°ν•˜κ²Œ λ°˜μ‘ν•œλ‹€λŠ” λœ»μž…λ‹ˆλ‹€.

κ·Έλž˜μ„œ λ“±μž₯ν•œ 것이 λ°”λ‘œ CoOp (Context Optimization)!

이 CoOp연ꡬλ₯Ό 톡해 ν”„λ‘¬ν¬νŠΈ νŠœλ‹μ„ λ°”νƒ•μœΌλ‘œ CLIP을 fine-tuningν• μˆ˜ μžˆλ‹€λŠ” 것을 μ•Œκ²Œλ˜μ—ˆμŠ΅λ‹ˆλ‹€!

  • 🧠 CoOp은 ν”„λ‘¬ν”„νŠΈ λ¬Έμž₯을 ν•™μŠ΅ κ°€λŠ₯ν•œ 연속 λ²‘ν„°λ‘œ λŒ€μ²΄ν•©λ‹ˆλ‹€.
    • 예λ₯Ό λ“€λ©΄ this is a dog라고 ν—€λ˜κ²ƒμ„ [V1] [V2] [V3] dog 라고 μž…λ ₯ν•˜λŠ”κ²ƒμž…λ‹ˆλ‹€!
    • μ΄λ•Œ [V1] [V2] [V3] 은 Fine-tuningν•˜λ©΄μ„œ ν•™μŠ΅λ˜λŠ” λ²‘ν„°λ‘œ κ²°κ΅­ μ‚¬λžŒμ€ dog만 μž…λ ₯ν•˜λ©΄ λ˜λŠ”κ±°μ£ !
  • μ‚¬λžŒμ΄ 직접 ν”„λ‘¬ν”„νŠΈλ₯Ό λ””μžμΈν•  ν•„μš” 없이(prompt-free tuning)으둜써,
  • λͺ¨λΈμ΄ ν”„λ‘¬ν”„νŠΈ 자체λ₯Ό ν•™μŠ΅ν•˜κ²Œ λ§Œλ“  것이죠.

❗ 문제 2: ν…μŠ€νŠΈλ§Œ νŠœλ‹ν•˜λŠ” λ°©μ‹μ˜ ν•œκ³„

ν•˜μ§€λ§Œ CoOp은 ν…μŠ€νŠΈ ν”„λ‘¬ν¬νŠΈ λΆ€λΆ„λ§Œ λ―Έμ„Έμ‘°μ •(Fine-Tuning)ν•©λ‹ˆλ‹€.
μ΄λ―Έμ§€μ˜ featureλŠ” κ·ΈλŒ€λ‘œ λ‘”λ‹€λŠ” 뜻이죠.

β€œν…μŠ€νŠΈλŠ” ν•™μŠ΅ν–ˆλŠ”λ°, 이미지 ν‘œν˜„μ€ μ—¬μ „νžˆ κ³ μ •λ˜μ–΄ μžˆλ‹€?”

이런 λΆˆκ· ν˜•μ€ 특히 특수 λ„λ©”μΈμ΄λ‚˜ few-shot ν•™μŠ΅μ—μ„œ μ„±λŠ₯ μ €ν•˜λ‘œ μ΄μ–΄μ§ˆ 수 μžˆμŠ΅λ‹ˆλ‹€.

CoOp의 ꡬ쑰!! ν…μŠ€νŠΈ ν”„λ‘¬ν¬νŠΈ μ•žμ˜ V1,V2 λ“±λ“±λ§Œ ν•™μŠ΅ν•©λ‹ˆλ‹€!!
였늘의 Clip AdapterλŠ” 이와 λ‹€λ₯΄κ²Œ ν…μŠ€νŠΈ, Image 에 λŒ€ν•˜μ—¬ λͺ¨λ‘ adapterκ°€ 있죠!?
compareCOOP


πŸ’‘ CLIP-Adapter ꡬ쑰!!!

structure

CLIP-AdapterλŠ” 이미지와 ν…μŠ€νŠΈμ˜ feature levelμ—μ„œ 직접 쑰정을 μˆ˜ν–‰ν•©λ‹ˆλ‹€.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
            β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
            β”‚   Image Input (x)  β”‚                β”‚   Text Input (t)   β”‚
            β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                     ↓                                     ↓
       β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
       β”‚   CLIP Image Encoder       β”‚         β”‚   CLIP Text Encoder        β”‚
       β”‚       (frozen)             β”‚         β”‚        (frozen)            β”‚
       β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜         β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                 ↓                                     ↓
     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”               β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
     β”‚  Image Adapter MLP  β”‚               β”‚  Text Adapter MLP   β”‚
     β”‚     (trainable)     β”‚               β”‚     (trainable)     β”‚
     β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜               β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
              ↓                                       ↓
     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”       β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
     β”‚ Residual: image + adapted  β”‚       β”‚ Residual: text + adapted   β”‚
     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜       β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
              ↓                                       ↓
     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
     β”‚  Image Embedding   β”‚              β”‚  Text Embedding    β”‚
     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                     ↓
                      Cosine Similarity / Classification

κ²°κ΅­ μœ„μ˜ κ΅¬μ‘°μ—μ„œ Adapter MLP 와 Residual Connection 뢀뢄이 이번 μ—°κ΅¬μ˜ ν•΅μ‹¬μΈλ°μš”!!

πŸ”§ Adapter MLP (Image, Text에 각각!!)

adapter

Adapter λΆ€λΆ„μ˜ MLPλŠ”!!

  • 두 개의 μ„ ν˜• 계측 + ReLU λΉ„μ„ ν˜• ν•¨μˆ˜κ΅¬μ‘°λ‘œμ„œ,
  • ꡬ쑰: Linear β†’ ReLU β†’ Linear
  • Bottleneck ꡬ쑰둜 쀑간 μ°¨μ›μœΌλ‘œ μΆ•μ†Œν–ˆλ‹€κ°€ λ‹€μ‹œ ν™•μž₯ν•˜κ²Œ λ©λ‹ˆλ‹€!!
πŸ–‡οΈ Residual Connection

residual

few-shot 으둜 ν•™μŠ΅ν•˜κ²Œ λœλ‹€λ©΄!!
ν•™μŠ΅ 데이터가 극히 적기 λ•Œλ¬Έμ—, λͺ¨λΈμ΄ 데이터에 μ§€λ‚˜μΉ˜κ²Œ λ§žμΆ°μ§€λŠ”(overfitting) κ²½ν–₯이 μžˆμŠ΅λ‹ˆλ‹€!
이런 μ˜€λ²„ν”ΌνŒ…μ— λŒ€ν•œ ν•΄κ²° λ°©λ²•μœΌλ‘œ Residual Connection (μž”μ°¨ μ—°κ²°)을 μ μš©ν–ˆμŠ΅λ‹ˆλ‹€!

핡심 μ•„μ΄λ””μ–΄λŠ” "μƒˆλ‘­κ²Œ ν•™μŠ΅ν•œ ν‘œν˜„κ³Ό, 기쑴에 잘 ν•™μŠ΅λœ CLIP ν‘œν˜„μ„ λΉ„μœ¨μ„ μ‘°μ ˆν•΄ μ„žμž." λ‘œμ„œ

  1. (이미지와 ν…μŠ€νŠΈμ˜ CLIP μž„λ² λ”© κ²°κ³Όλ₯Ό adapter 에 ν†΅κ³Όμ‹œν‚¨ κ²°κ³Ό) X Ξ±
  2. (이미지와 ν…μŠ€νŠΈμ˜ κΈ°μ‘΄ CLIP μž„λ² λ”© κ²°κ³Ό) X (1-Ξ±)

둜 ν•˜μ—¬ ν•™μŠ΅ 결과및 CLIP의 κΈ°μ‘΄ κ²°κ³Όλ₯Ό μ•Œλ§žκ²Œ μ„žμ–΄ μ€λ‹ˆλ‹€!

πŸ”¬ μ„±λŠ₯ μ‹€ν—˜!!

CLIP-Adapter μ‹€ν—˜ μ„ΈνŒ…!
  1. πŸ“Š μ‚¬μš©ν•œ 데이터셋

CLIP-AdapterλŠ” 총 11개의 이미지 λΆ„λ₯˜ λ°μ΄ν„°μ…‹μ—μ„œ μ„±λŠ₯을 ν‰κ°€ν–ˆμŠ΅λ‹ˆλ‹€:

  • ImageNet
  • StanfordCars
  • UCF101
  • Caltech101
  • Flowers102
  • SUN397
  • DTD
  • EuroSAT
  • FGVCAircraft
  • OxfordPets
  • Food101

각 데이터셋에 λŒ€ν•΄ 1, 2, 4, 8, 16-shot μ„€μ •μœΌλ‘œ fine-tuning을 μˆ˜ν–‰ν•˜κ³ ,
전체 ν…ŒμŠ€νŠΈ μ„ΈνŠΈμ—μ„œ μ„±λŠ₯을 μΈ‘μ •ν•©λ‹ˆλ‹€.
λͺ¨λ“  μ‹€ν—˜μ€ NVIDIA A100 GPU 단일 μž₯λΉ„μ—μ„œ μˆ˜ν–‰λ˜λ©°,
각 μ‹€ν—˜μ€ 3회 λ°˜λ³΅ν•˜μ—¬ 평균 정확도λ₯Ό μ‚°μΆœν•©λ‹ˆλ‹€!!

  1. βš™οΈ κ΅¬ν˜„ μ„ΈλΆ€ μ„€μ •
  • κΈ°λ³Έ ꡬ쑰: 이미지 νŠΉμ„±λ§Œ fine-tune (visual adapter), ν…μŠ€νŠΈ(branch)λŠ” κ³ μ •

  • ν•˜μ΄νΌνŒŒλΌλ―Έν„°:
    • 배치 μ‚¬μ΄μ¦ˆ: 32
    • ν•™μŠ΅λ₯ : 1e-5
    • μž”μ°¨ λΉ„μœ¨ Ξ±, Ξ²λŠ” 각 λ°μ΄ν„°μ…‹λ§ˆλ‹€ 탐색을 톡해 선택 (grid search)
  • λ°±λ³Έ(backbone):
    • Visual encoder: ResNet-50
    • Text encoder: 12-layer Transformer
  • μ–΄λŒ‘ν„° hidden embedding: μ‹œκ°/ν…μŠ€νŠΈ μ–΄λŒ‘ν„° λͺ¨λ‘ 256 (κΈ°μ‘΄ μž„λ² λ”©μ˜ 1/4)

  • ν”„λ‘¬ν”„νŠΈ μž…λ ₯:
    • CoOpκ³Ό 달리, κ³ μ • ν…μŠ€νŠΈ ν”„λ‘¬ν”„νŠΈ μ‚¬μš©
      예: "a photo of a {class}"
    • μ„Έλ°€ν•œ λΆ„λ₯˜μ—λŠ” 도메인 ν‚€μ›Œλ“œλ₯Ό 포함
      예: "a centered satellite photo of {class}"
CLIP-Adapter μ‹€ν—˜ κ²°κ³Ό 뢄석!!
  1. κΈ°λ³Έ μ‹€ν—˜
    • CLIP-AdapterλŠ” μ„±λŠ₯을 λΉ„κ΅ν•˜κΈ° μœ„ν•΄ λ‹€μŒ 3κ°€μ§€ μ£Όμš” 베이슀라인과 비ꡐ μ‹€ν—˜μ„ μ§„ν–‰ν–ˆμŠ΅λ‹ˆλ‹€!
  • Zero-shot CLIP : CLIP λͺ¨λΈ κ·ΈλŒ€λ‘œ, a photo of {class} 둜 ν”„λ‘¬ν¬νŠΈμ‚¬μš©
  • Linear probe CLIP : CLIP의 이미지 μΈμ½”λ”λŠ” κ³ μ •μ‹œν‚€κ³ , κ·Έ μœ„μ— 얕은 μ„ ν˜• λΆ„λ₯˜κΈ°(linear classifier)만 ν•™μŠ΅.
  • CoOp (Context Optimization) : ν…μŠ€νŠΈ ν”„λ‘¬ν¬νŠΈμ— λŒ€ν•˜μ—¬ V1 V2λ₯Ό μΆ”κ°€ν•˜μ—¬ ν•™μŠ΅

res_compare

CLIP-Adapter 결곑 쒋은 μ„±λŠ₯을 λ³΄μ—¬μ£Όμ—ˆμŠ΅λ‹ˆλ‹€!!
μœ„ μ΄λ―Έμ§€μ—μ„œ 보듯, 짧은 ν•™μŠ΅, 적은 parameter및 GPUλ©”λͺ¨λ¦¬ λΉ λ₯Έ 속도에 높은 정확도λ₯Ό λ³΄μ—¬μ€¬λŠ”λ°μš”!
λΏλ§Œμ•„λ‹ˆλΌ 적은 데이터셋 ν•™μŠ΅ (few shot) μ—μ„œλ„ μ’‹μ•˜μ–΄μš”!!

  1. μ–΄λŽν„°λŠ” 어디에!?

μΆ”κ°€λ‘œ μ–΄λŽν„°λ₯Ό μ΄λ―Έμ§€λ§Œ, ν…μŠ€νŠΈλ§Œ, μ΄λ―Έμ§€λž‘ ν…μŠ€νŠΈ λͺ¨λ‘ 에 λΆ™μ΄λŠ” 비ꡐ도 ν•΄λ³΄μ•˜κ³ !!

adaptersto

κ²°κ΅­ μ΄λ―Έμ§€λ§Œ ν•˜λŠ”κ²Œ 제일 μ’‹μ•˜λ‹€κ³ ν•©λ‹ˆλ‹€!!

where

λ˜ν•œ 12개 Transformerλ ˆμ΄μ–΄λ‘œ κ΅¬μ„±λœ CLIP 의 μ•žλΆ€λΆ„, 쀑간뢀뢄 등에 λΆ™μ΄λŠ”κ²ƒλ„ ν…ŒμŠ€νŠΈν•΄λ³΄μ•˜κ³ ,
μ§€κΈˆκΉŒμ§€ μ΄ν•΄ν•œκ²ƒ 처럼 CLIP의 맨 λ’·λΆ€λΆ„,
즉 12번쨰 λ ˆμ΄μ–΄(CLIP이 12개 Layer둜 ꡬ성) 뒀에 λΆ™μ΄λŠ” 것이 κ°€μž₯ 효율이 μ’‹μ•˜μŠ΅λ‹ˆλ‹€!!

  1. μž”μ°¨ ν•™μŠ΅μ˜ κ³„μˆ˜λŠ”?!
    • μ˜€λ²„ν”ΌνŒ…μ„ λ§‰κΈ°μœ„ν•œ Residual Connection의 κ³„μˆ˜ 평가λ₯Ό μ§„ν–‰ν–ˆκ³ !!

a. μ„Έλ°€ν•œ λ„λ©”μΈμ˜ fine-grained λ°μ΄ν„°μ…‹μ˜ κ²½μš°λŠ” 졜적의 Ξ± 값이 보톡 0.6 ~ 0.8 μˆ˜μ€€μ—!,

b. Caltech-101μ΄λ‚˜ ImageNet처럼 포괄적이고 일반적인 이미지 λ°μ΄ν„°μ…‹μ—μ„œλŠ” 졜적 Ξ± 값이 μ•½ 0.2 μˆ˜μ€€μ΄μ—ˆλ‹€κ³  ν•©λ‹ˆλ‹€!


🧠 마무리 생각

LORA에 이어 λ‘λ²ˆμ§Έλ‘œ 곡뢀해본 PEFT (Parameter Efficient Fine Tuning) 기법!!
μ‹œλ„λ„ μ°Έμ‹ ν•  λΏλ§Œμ•„λ‹ˆλΌ μ„±λŠ₯도 μΈμƒμ μ΄μ„œ!
μ•žμœΌλ‘œ 이 방식을 κΈ°μ–΅ν•΄μ„œ μ—¬λŸ¬κ³³μ— μ‚¬μš©ν•΄λ΄μ•Όκ² μŠ΅λ‹ˆλ‹€!!

+ μ–΄λŽν„°ν•˜λ©΄ μ „κΈ°μ½˜μ„ΌνŠΈ μ–΄λŽν„°λ§Œ λ– μ˜¬λžλŠ”λ°, μ•žμœΌλ‘œλŠ” 이 CLIP-Adapterκ°€ 기얡에 남을것 κ°™λ„€μš”! :)


This post is licensed under CC BY 4.0 by the author.