Post

๐Ÿ“Understanding BLIP - BLIP ์•Œ์•„๋ณด๊ธฐ?!!

๐Ÿ“Understanding BLIP - BLIP ์•Œ์•„๋ณด๊ธฐ?!!

๐Ÿง  Understanding BLIP (in English!)

๐Ÿ” It learned even from messy web data, and now it knows how to describe images all on its own!

manhwa

Paper: BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
Conference: ICML 2022 (Salesforce Research)
Code: salesforce/BLIP


๐Ÿ’ก What is BLIP?

Unlike CLIP by OpenAI,
BLIP supports generation-based multimodal tasks, like interactive image descriptions!


1. ๐Ÿง  Bidirectional Vision-Language Learning

  • Most models focus on either understanding or generation
  • BLIP flexibly supports both directions!

2. ๐Ÿงน Bootstrapping Web Data

Think of bootstrapping here as an iterative self-improvement process!

  • To clean noisy web-collected image-text pairs:
    • Captioner: generates synthetic captions
    • Filter: removes noisy examples
  • ๐Ÿ‘‰ This enables construction of a much more reliable dataset

3. ๐Ÿš€ Zero-shot Transferability

  • Even without fine-tuning, BLIP performs on video-language tasks!
  • It shows strong generalization ability and versatility

๐Ÿง  Why Did BLIP Emerge?


โŒ Limitations of Previous Models

  • Most VLP (Vision-Language Pretraining) models were split:
    • Understanding tasks (e.g., image-text retrieval)
    • Generation tasks (e.g., captioning)
  • Encoder-only (e.g., CLIP) or Encoder-Decoder models could not handle both well

๐ŸŒ Poor Quality of Web Data

  • Most VLP models were trained on noisy web-collected pairs
  • Simple rule-based filtering couldnโ€™t sufficiently clean the data
  • Scaling up data helped performance, but lower text quality hindered fine-tuned learning

๐Ÿ“˜ Limitations of Traditional Data Augmentation

  • In vision: augmentations like cropping, rotation are common
  • In NLP: itโ€™s much harder to augment text
  • Existing VLP models used little to no textual augmentation, relying heavily on low-quality alt-text

BLIP overcomes this by generating its own captions using a pretrained LLM,
producing meaningful synthetic data rather than noise!


๐Ÿš€ BLIP Architecture & Learning


๐Ÿ–‡๏ธ BLIP Model Components

BLIP has three key components for understanding, aligning, and generating across image and text.
These are collectively called the โ€œMEDโ€ architecture.

structure

โœ… Summary of Training Objectives by Module
Category๐ŸŸฆ ITC (Contrastive)๐ŸŸฉ ITM (Matching)๐ŸŸจ LM (Language Modeling)
๐ŸŒ GoalAlign visual & text embeddingsDetermine if image-text pair matchesGenerate text from image
๐Ÿง  EncodingEncode image & text separatelyInject image into text via cross-attentionDecode text from image with a decoder
โš™๏ธ ModeUnimodal EncoderImage-Grounded Text EncoderImage-Grounded Text Decoder
๐Ÿ” LossContrastive Loss (pull/push pairs)Binary classification (match or not)Cross-Entropy + autoregressive generation
๐ŸŽฏ StrengthWell-structured latent space for retrievalFine-grained alignment of multimodal inputFluent and informative text generation

๐Ÿงฉ 1. Unimodal Encoder (ITC)

Separately encodes image and text
Corresponds to ITC in the figure above

  • Image encoder: ViT-based (patch + [CLS])
  • Text encoder: BERT-like, with [CLS]
  • Use: Retrieval and similarity-based understanding

๐Ÿงฉ 2. Image-Grounded Text Encoder (ITM)

Injects image into text encoding via Cross-Attention
Corresponds to ITM in the diagram

  • Architecture: Transformer with Self-Attn โ†’ Cross-Attn โ†’ FFN
  • Special [Encode] token added for multimodal representation
  • Use: Tasks requiring fine-grained image-text matching

๐Ÿงฉ 3. Image-Grounded Text Decoder (LM)

For sequential generation using Causal Self-Attention
Corresponds to LM in the diagram

  • Left-to-right decoding only
  • Starts with [Decode] token, ends with [EOS]
  • Use: Captioning, VQA, generation tasks

๐Ÿ”„ Parameter Sharing Across Modules

SA must be separate due to directionality,
but Embedding, Cross-Attn, and FFN are shared!

Layer TypeShared?Notes
Self-Attention (SA)โŒ NoBi-directional for encoder vs. causal for decoder
Embedding Layerโœ… YesWord-to-vector layer is shared
Cross-Attention (CA)โœ… YesConnects image and text
Feed Forward Networkโœ… YesPost-attention computations

See below โ€” sharing everything except SA yields best trade-off!
sharing


๐Ÿงฌ CapFilt: Self-Filtering and Captioning of Web Data

CapFilt

  • Raw Data: Human-annotated image-text pairs (Iโ‚•, Tโ‚•)
  • Initial Training: Train MED model with (Iโ‚•, Tโ‚•)
  • Web Data: Noisy pairs from the internet (I๐‘ค, T๐‘ค)

Then:

  1. Train the Filter
    • Fine-tune Image-grounded Text Encoder with ITC/ITM loss on (Iโ‚•, Tโ‚•)
    • Apply it to (I๐‘ค, T๐‘ค) to remove noise โ†’ (I๐‘ค, T๐‘คโ€ฒ)
  2. Train the Captioner
    • Fine-tune Image-grounded Text Decoder with LM loss on (Iโ‚•, Tโ‚•)
    • Generate synthetic captions for I๐‘ค โ†’ (I๐‘ค, Tโ‚›)
  3. Re-filter synthetic captions
    • Apply Filter again to (I๐‘ค, Tโ‚›) to remove mismatched results
    • Final result: (I๐‘ค, Tโ‚›โ€ฒ)

Noisy red text is removed, green synthetic captions are created!
filter


โœ… Final Datasets Used to Re-train MED

  • A. Human-annotated pairs (Iโ‚•, Tโ‚•)
  • B. Filtered web-text pairs (I๐‘ค, T๐‘คโ€ฒ)
  • C. Filtered synthetic captions (I๐‘ค, Tโ‚›โ€ฒ)

These three are merged into a new training dataset for retraining MED!
This forms a bootstrapping loop where both the data and model improve together ๐Ÿ’ช

And indeed, as shown below โ€” using both Captioner and Filter improves performance!
captfilt_res


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
๐Ÿ“ฆ Initial MED Pretraining
โ”œโ”€โ”€ Human-labeled pairs: (Iโ‚•, Tโ‚•)
โ”œโ”€โ”€ Web-collected pairs: (I๐‘ค, T๐‘ค)
โ””โ”€โ”€ โžค Initial MED training (Losses: ITC + ITM + LM)

๐Ÿ”ง CapFilt: Data Filtering and Module Refinement
โ”œโ”€โ”€ Filter Training
โ”‚   โ”œโ”€โ”€ Module: Image-grounded Text Encoder
โ”‚   โ”œโ”€โ”€ Data: (Iโ‚•, Tโ‚•)
โ”‚   โ””โ”€โ”€ Loss: ITC / ITM
โ”‚       โ””โ”€โ”€ Result: Filtered text (I๐‘ค, T๐‘คโ€ฒ)
โ”‚
โ”œโ”€โ”€ Captioner Training
โ”‚   โ”œโ”€โ”€ Module: Image-grounded Text Decoder
โ”‚   โ”œโ”€โ”€ Data: (Iโ‚•, Tโ‚•)
โ”‚   โ””โ”€โ”€ Loss: LM
โ”‚       โ””โ”€โ”€ Generate synthetic text (I๐‘ค, Tโ‚›)
โ”‚                                โ””โ†’ Filter again โ†’ (I๐‘ค, Tโ‚›โ€ฒ)

๐Ÿง  Final Re-Pretraining of MED
โ”œโ”€โ”€ Human annotations: (Iโ‚•, Tโ‚•)
โ”œโ”€โ”€ Filtered web text: (I๐‘ค, T๐‘คโ€ฒ)
โ”œโ”€โ”€ Filtered synthetic captions: (I๐‘ค, Tโ‚›โ€ฒ)
โ””โ”€โ”€ โžค Train new MED on clean dataset Dโ€ฒ

๐Ÿš€ Results from BLIP

analy

BLIP outperforms prior models across both retrieval and captioning tasks!


๐Ÿง  Final Thoughts

While todayโ€™s LLMs easily interpret and generate across modalities,
in 2022, BLIP was a milestone that helped machines โ€œspeak based on what they see.โ€

It went beyond classification or embedding โ€”
and introduced the ability to generate meaning from vision using language.

Its self-bootstrapping of both data and learning likely inspired later self-supervised techniques,
including models like DINO in vision representation learning.

To understand modern multimodal AI,
BLIP is truly a foundational model worth studying! ๐Ÿ˜Š


๐Ÿง  (ํ•œ๊ตญ์–ด) BLIP ์•Œ์•„๋ณด๊ธฐ!

๐Ÿ” ๊น”๋”ํ•˜์ง€ ์•Š์€ ์›น๋ฐ์ดํ„ฐ๋„๋กœ ํ˜ผ์ž์„œ๋„ ์ž˜ ํ•™์Šตํ–ˆ๊ณ , ๊ทธ๋ฆผ์„ ๋ณด๊ณ  ํ•ด์„ํ•  ์ค„ ์•Œ์•„์š”!!!

manhwa

๋…ผ๋ฌธ: BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
๋ฐœํ‘œ: ICML 2022 (Salesforce Research)
์ฝ”๋“œ: salesforce/BLIP


๐Ÿ’ก BLIP ์š”์•ฝ!!

์ด๋ฆ„์ด ๋น„์Šทํ•œ, OpenAI์˜ CLIP๊ณผ์˜ ์ฐจ์ด์ ์€!!,
ํ…์ŠคํŠธ ์ƒ์„ฑ๊นŒ์ง€ ํฌํ•จ๋œ ๋Œ€ํ™”ํ˜• ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ์ฒ˜๋ฆฌ๋„ ๊ฐ€๋Šฅํ•˜๋‹ค๋Š” ์ !!

  1. ๐Ÿง  ์–‘๋ฐฉํ–ฅ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ํ•™์Šต
    • ๊ธฐ์กด ๋ชจ๋ธ๋“ค์€ ์ดํ•ด or ์ƒ์„ฑ ์ค‘ ํ•˜๋‚˜์— ํŠนํ™”
    • BLIP๋Š” ๋‘ ๋ฐฉํ–ฅ ๋ชจ๋‘์— ์œ ์—ฐํ•˜๊ฒŒ ์ „์ด ๊ฐ€๋Šฅ!
  2. ๐Ÿงน ์›น ๋ฐ์ดํ„ฐ ๋ถ€ํŠธ์ŠคํŠธ๋ž˜ํ•‘

    ์•ž์œผ๋กœ ์ด ๋ถ€ํŠธ์ŠคํŠธ๋ž˜ํ•‘์€ ์ ์ง„์  ๊ฐœ์„  ์ด๋ผ๋Š” ์˜๋ฏธ๋กœ ๋ฐ›์•„๋“œ๋ ค์ฃผ์„ธ์š”!

  • ์›น์—์„œ ์ˆ˜์ง‘ํ•œ noisy image-text ์Œ ๊ฐœ์„ ์„ ์œ„ํ•ด
    • Captioner: ๋ฌธ์žฅ ์ƒ์„ฑ (synthetic caption)
    • Filter: ๋…ธ์ด์ฆˆ ์ œ๊ฑฐ
  • ๐Ÿ‘‰ ๋” ์‹ ๋ขฐ๋„ ๋†’์€ ํ•™์Šต ๋ฐ์ดํ„ฐ ๊ตฌ์„ฑ
  1. ๐Ÿš€ ์ œ๋กœ์ƒท ์ „์ด ๊ฐ€๋Šฅ
    • ๋น„๋””์˜ค-์–ธ์–ด ์ž‘์—…์—๋„ ์‚ฌ์ „ํ•™์Šต๋งŒ์œผ๋กœ ์ง์ ‘ ์ ์šฉ ๊ฐ€๋Šฅ
    • ๊ฐ•๋ ฅํ•œ ๋ฒ”์šฉ์„ฑ๊ณผ ์ผ๋ฐ˜ํ™” ๋Šฅ๋ ฅ ๋ณด์œ 

๐Ÿง  BLIP ๋“ฑ์žฅ์˜ ๋ฐฐ๊ฒฝ


โŒ ๊ธฐ์กด ๋ชจ๋ธ์˜ ์ด๋ถ„ํ™”๋œ ์„ฑ๋Šฅ

  • ๋Œ€๋ถ€๋ถ„์˜ VLP(Vision-Language Pretraining) ๋ชจ๋ธ๋“ค์€
    • ์ดํ•ด ์ค‘์‹ฌ ์ž‘์—… (์˜ˆ: ์ด๋ฏธ์ง€-ํ…์ŠคํŠธ ๊ฒ€์ƒ‰)
    • ๋˜๋Š” ์ƒ์„ฑ ์ค‘์‹ฌ ์ž‘์—… (์˜ˆ: ์ด๋ฏธ์ง€ ์บก์…˜ ์ƒ์„ฑ)
      ์ค‘ ํ•œ์ชฝ์—๋งŒ ํŠนํ™”๋˜์–ด ์žˆ์Œ.
  • Encoder-only (์˜ˆ: CLIP) ๋˜๋Š” Encoder-Decoder ๋ชจ๋ธ ๊ตฌ์กฐ๋Š”
    ๋‘ ์ž‘์—…์„ ๋™์‹œ์— ์ž˜ ์ˆ˜ํ–‰ํ•˜์ง€ ๋ชปํ•˜๋Š” ๊ตฌ์กฐ์  ํ•œ๊ณ„๊ฐ€ ์žˆ์Œ.

๐ŸŒ ์›น ๊ธฐ๋ฐ˜ ๋ฐ์ดํ„ฐ์˜ ํ’ˆ์งˆ ๋ฌธ์ œ

  • ๊ธฐ์กด VLP๋Š” ๋Œ€๋ถ€๋ถ„ ์›น์—์„œ ์ˆ˜์ง‘๋œ noisyํ•œ ์ด๋ฏธ์ง€-ํ…์ŠคํŠธ ์Œ์œผ๋กœ ํ•™์Šต๋จ.
  • ๋‹จ์ˆœํ•œ ๊ทœ์น™ ๊ธฐ๋ฐ˜ ํ•„ํ„ฐ๋ง๋งŒ์œผ๋กœ๋Š” ๋…ธ์ด์ฆˆ ์ œ๊ฑฐ๊ฐ€ ๋ถˆ์™„์ „ํ•จ.
  • ๋ฐ์ดํ„ฐ ๊ทœ๋ชจ๋ฅผ ํ‚ค์šฐ๋ฉด ์„ฑ๋Šฅ์€ ์ฆ๊ฐ€ํ•˜์ง€๋งŒ,
    ํ…์ŠคํŠธ ํ’ˆ์งˆ ์ €ํ•˜๊ฐ€ ์„ธ๋ฐ€ํ•œ ํ•™์Šต์— ์žฅ์• ๊ฐ€ ๋จ.

๐Ÿ“˜ Data Augmentation ์—์„œ์˜ ํ•œ๊ณ„

  • ๊ธฐ์กด ๋ฐฉ์‹:
    • ์ปดํ“จํ„ฐ ๋น„์ „์—์„œ๋Š” ์ด๋ฏธ์ง€ ํšŒ์ „, ํฌ๋กญ ๋“ฑ์˜ ๋ฐ์ดํ„ฐ ์ฆ๊ฐ•์ด ์ผ๋ฐ˜์ ์ด์ง€๋งŒ,
    • ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ์—์„œ๋Š” ์ฆ๊ฐ•์ด ๊นŒ๋‹ค๋กœ์›€.
  • ๊ธฐ์กด VLP์—์„œ๋Š” ์–ธ์–ด ๊ธฐ๋ฐ˜ ์ฆ๊ฐ•์ด ๊ฑฐ์˜ ์ ์šฉ๋˜์ง€ ์•Š๊ฑฐ๋‚˜,
    ์ €ํ’ˆ์งˆ ์›น ํ…์ŠคํŠธ์— ์˜์กด

    BLIP์€ LLM์œผ๋กœ ์บก์…˜์„ ๋งŒ๋“ค์–ด, ๋…ธ์ด์ฆˆ๊ฐ€ ์•„๋‹Œ ์ง„์งœ ์˜๋ฏธ์žˆ๋Š” ๋ฐ์ดํ„ฐ๋ฅผ ์–ป์–ด๋ƒ„!!


๐Ÿš€ BLIP ๋ชจ๋ธ ๊ตฌ์กฐ์™€ ํ•™์Šต!


๐Ÿ–‡๏ธ BLIP ๋ชจ๋ธ ๊ตฌ์กฐ

BLIP์€ ์ด๋ฏธ์ง€ ํ…์ŠคํŠธ๋ฅผ ๊ฐ๊ฐ ์ดํ•ดํ•˜๋Š”๊ฑฐ๋ž‘, ํ•จ๊ผ ์ดํ•ดํ•˜๋Š”๊ฑฐ๋ž‘, ์ด๋ฏธ์ง€๋ฅผ ํ•ด์„ํ•˜๋Š”๊ฒƒ ๋ชจ๋‘ ํ•˜๊ธฐ์œ„ํ•ด 3๊ฐœ์˜ ๋ชจ๋“ˆ์ด ์กด์žฌ!!
์ด๋ฅผ โ€œMEDโ€ ๋ผ๊ณ  ๋ถ€๋ฅธ๋‹ค!!

structure

โœ… BLIP ๋ชจ๋ธ๋ณ„ ํ•™์Šต ๋ชฉ์  ์š”์•ฝ
ํ•ญ๋ชฉ๐ŸŸฆ ITC (Image-Text Contrastive)๐ŸŸฉ ITM (Image-Text Matching)๐ŸŸจ LM (Language Modeling)
๐ŸŒ ๋ชฉ์ ํ‘œํ˜„ ๊ณต๊ฐ„ ์ •๋ ฌ (similarity)์ •๋‹ต/์˜ค๋‹ต ํŒ๋ณ„ (match ์—ฌ๋ถ€)์ด๋ฏธ์ง€ ๊ธฐ๋ฐ˜ ํ…์ŠคํŠธ ์ƒ์„ฑ (ํ…์ŠคํŠธ ์ƒ์„ฑ ๋Šฅ๋ ฅ)
๐Ÿง  ์ธ์ฝ”๋”ฉ ๋ฐฉ์‹์ด๋ฏธ์ง€์™€ ํ…์ŠคํŠธ๋ฅผ ๊ฐ๊ฐ ๋”ฐ๋กœ ์ธ์ฝ”๋”ฉ์ด๋ฏธ์ง€+ํ…์ŠคํŠธ ์Œ์„ cross-attention์œผ๋กœ ์ฃผ์ž…์ด๋ฏธ์ง€ + ๋””์ฝ”๋”์—์„œ ํ…์ŠคํŠธ ์ƒ์„ฑ
โš™๏ธ ์‚ฌ์šฉ ๋ชจ๋“œUnimodal EncoderImage-Grounded Text EncoderImage-Grounded Text Decoder
๐Ÿ” ํ•™์Šต ๋ฐฉ์‹Contrastive Loss (์–‘์„ฑ: ๊ฐ€๊นŒ์ด / ์Œ์„ฑ: ๋ฉ€๋ฆฌ)Binary Classification (์–‘์„ฑ/์Œ์„ฑ ๋ถ„๋ฅ˜)Cross-Entropy Loss + ์˜คํ† ๋ฆฌ๊ทธ๋ ˆ์‹œ๋ธŒ ์ƒ์„ฑ
๐ŸŽฏ ํŠน์ง•ํ‘œํ˜„ ๊ณต๊ฐ„ ๊ตฌ์กฐํ™” โ†’ ๊ฒ€์ƒ‰์— ์œ ๋ฆฌ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ์˜๋ฏธ ์ •๋ ฌ โ†’ fine-grained ๋Œ€์‘ ๊ฐ€๋Šฅ์ž์—ฐ์Šค๋Ÿฌ์šด ๋ฌธ์žฅ ์ƒ์„ฑ ๋Šฅ๋ ฅ, ์บก์…˜/์งˆ๋ฌธ์‘๋‹ต ๊ฐ•ํ™”

๐Ÿงฉ 1. Unimodal Encoder

์ด๋ฏธ์ง€์™€ ํ…์ŠคํŠธ๋ฅผ ๊ฐ๊ฐ ๋…๋ฆฝ์ ์œผ๋กœ ํ•ด์„
์œ„์˜ ์ด๋ฏธ์ง€์—์„œ ITC(image-text contrastive)์— ํ•ด๋‹น

  • ์ด๋ฏธ์ง€ ์ธ์ฝ”๋”: ViT ๊ธฐ๋ฐ˜ (Patch + [CLS])
  • ํ…์ŠคํŠธ ์ธ์ฝ”๋”: BERT ๊ตฌ์กฐ, ์ž…๋ ฅ์— [CLS] ํ† ํฐ ํฌํ•จ
  • ์‚ฌ์šฉ ์šฉ๋„: ์ด๋ฏธ์ง€-ํ…์ŠคํŠธ ๋งค์นญ, retrieval ๋“ฑ์˜ ์ดํ•ด ์ค‘์‹ฌ ํƒœ์Šคํฌ

๐Ÿงฉ 2. Image-Grounded Text Encoder

ํ…์ŠคํŠธ ์ธ์ฝ”๋” ๋‚ด๋ถ€์— Cross-Attention์„ ์‚ฝ์ž…ํ•˜์—ฌ ์ด๋ฏธ์ง€ ์ •๋ณด๋ฅผ ์ฃผ์ž…
์œ„์˜ ์ด๋ฏธ์ง€์—์„œ ITM(a image-text matching)์— ํ•ด๋‹น

  • ๊ตฌ์กฐ: ๊ฐ Transformer block์— Self-Attention โ†’ Cross-Attention โ†’ FFN ์ˆœ์„œ
  • [Encode] ํ† ํฐ์„ ํ…์ŠคํŠธ ์ž…๋ ฅ ๋์— ์ถ”๊ฐ€ โ†’ ์ด ํ† ํฐ์˜ ์ถœ๋ ฅ์ด ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ํ‘œํ˜„
  • ์‚ฌ์šฉ ์šฉ๋„: ์„ธ๋ฐ€ํ•œ ์ด๋ฏธ์ง€-ํ…์ŠคํŠธ ์˜๋ฏธ ์ •๋ ฌ์ด ํ•„์š”ํ•œ ๊ฒฝ์šฐ

๐Ÿงฉ 3. Image-Grounded Text Decoder

์ˆœ์ฐจ์  ํ…์ŠคํŠธ ์ƒ์„ฑ์„ ์œ„ํ•œ Causal Self-Attention ๊ตฌ์กฐ
์œ„์˜ ์ด๋ฏธ์ง€์—์„œ LM(language modeling)์— ํ•ด๋‹น

  • ๊ตฌ์กฐ: ์–‘๋ฐฉํ–ฅ์ด ์•„๋‹Œ ์™ผ์ชฝ์—์„œ ์˜ค๋ฅธ์ชฝ์œผ๋กœ๋งŒ ํ๋ฅด๋Š” Attention
  • [Decode] ํ† ํฐ์œผ๋กœ ์‹œ์ž‘, [EOS]๋กœ ์ข…๋ฃŒ
  • ์‚ฌ์šฉ ์šฉ๋„: ์ด๋ฏธ์ง€ ์บก์…”๋‹, ์งˆ๋ฌธ ์‘๋‹ต ๋“ฑ ์ƒ์„ฑ ๊ธฐ๋ฐ˜ ํƒœ์Šคํฌ

๐Ÿ”„ ๋ชจ๋ธ ๋‚ด์˜ ํŒŒ๋ผ๋ฏธํ„ฐ ๊ณต์œ  ์—ฌ๋ถ€

SA๋Š” ์ธ์ฝ”๋”์™€ ๋””์ฝ”๋”์˜ โ€œ๋ˆˆ์˜ ๋ฐฉํ–ฅโ€์ด ๋‹ฌ๋ผ์„œ ๋”ฐ๋กœ ํ•„์š”ํ•˜์ง€๋งŒ,
EmbeddingยทCAยทFFN์€ โ€œ๋จธ๋ฆฌ ์† ๊ณ„์‚ฐ๊ธฐโ€๋Š” ๊ฐ™๊ธฐ ๋•Œ๋ฌธ์— ๊ณต์œ ํ•œ๋‹ค!!

๋ ˆ์ด์–ด ์ข…๋ฅ˜๊ณต์œ  ์—ฌ๋ถ€์„ค๋ช…
Self-Attention (SA)โŒ ๋ฏธ๊ณต์œ ์ธ์ฝ”๋”๋Š” ์–‘๋ฐฉํ–ฅ, ๋””์ฝ”๋”๋Š” ์ธ๊ณผ์ ์ด๊ธฐ ๋•Œ๋ฌธ
Embedding Layerโœ… ๊ณต์œ ๋‹จ์–ด๋ฅผ ๋ฒกํ„ฐ๋กœ ๋ณ€ํ™˜ํ•˜๋Š” ๋ถ€๋ถ„
Cross-Attention (CA)โœ… ๊ณต์œ ์ด๋ฏธ์ง€-ํ…์ŠคํŠธ ์—ฐ๊ฒฐ ๋ถ€๋ถ„
Feed Forward Network(FFN)โœ… ๊ณต์œ ์ธ์ฝ”๋”ฉ/๋””์ฝ”๋”ฉ ํ›„ ์ฒ˜๋ฆฌํ•˜๋Š” ๊ณ„์‚ฐ ๋ชจ๋“ˆ
  • ๊ณต์œ ๋˜๋Š” ๋ถ€๋ถ„ ๋•๋ถ„์— ํŒŒ๋ผ๋ฏธํ„ฐ๋„ ์ ˆ๊ฐ๋˜๊ณ !(๊ฒฝ๋Ÿ‰ํ™”), ํ•™์Šต ํšจ์œจ ๋ฐ ๋น„์šฉ๋„ ๊ฐ์†Œํ•ฉ๋‹ˆ๋‹ค!!

์•„๋ž˜ ์ด๋ฏธ์ง€๋ฅผ ํ†ตํ•ด ์™œ ์ด๋ ‡๊ฒŒ ๊ณต์œ ํ•˜๊ฒŒ ๋˜์—ˆ๋Š”์ง€ ํ…Œ์ŠคํŠธํ•œ ๊ฒฐ๊ณผ๋„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค!
SA ๋งŒ ๋นผ๊ณ  ๊ณต์œ ํ• ๋•Œ๊ฐ€ ํŒŒ๋ผ๋ฏธํ„ฐ๋Š” ์ž‘์œผ๋ฉด์„œ๋„ ๊ฒฐ๊ณผ๊ฐ€ ์ข‹์ง€์š”!?
sharing


๐Ÿงฌ CapFilt : ์›น ๋ฐ์ดํ„ฐ ๋ถ€ํŠธ์ŠคํŠธ๋ž˜ํ•‘์„ ์œ„ํ•œ, ์ž˜ ํ•ด์„๋‹ฌ๊ธฐ ํ•„ํ„ฐ!

CapFilt

  • ์‚ฌ์ „ ์žฌ๋ฃŒ : ์ธ๊ฐ„์ด annotate ํ•œ ์ด๋ฏธ์ง€&ํ…์ŠคํŠธ (Iโ‚•, Tโ‚•)
  • ์‚ฌ์ „ ํ•™์Šต๋ชจ๋ธ : ์‚ฌ์ „์žฌ๋ฃŒ๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ MED ๋ชจ๋ธ ํ•™์Šต ์‹œํ‚ด!!
  • ์‚ฌ์šฉํ•  ์žฌ๋ฃŒ : ์›น์—์„œ ์ˆ˜์ง‘๋œ ์ด๋ฏธ์ง€&ํ…์ŠคํŠธ ์Œ (I๐‘ค, T๐‘ค)

์ด์ œ ๋ณธ๊ฒฉ ์‹œ์ž‘ํ•ด์„œ!!

  1. ํ•„ํ„ฐ๋ฅผ ํ•™์Šต
    • ์ธ๊ฐ„ ์ฃผ์„ ์Œ (Iโ‚•, Tโ‚•) ๋กœ Image-grounded Text Encoder๋ฅผ ITC/ITM loss๋กœ ํ•™์Šต
    • ์›น ํ…์ŠคํŠธ ์Œ (I๐‘ค, T๐‘ค)์— ๋Œ€ํ•ด ๋…ธ์ด์ฆˆ๋ฅผ ์ œ๊ฑฐํ•˜์—ฌ ์ •์ œ๋œ ์Œ (I๐‘ค, T๐‘คโ€ฒ) ์ƒ์„ฑ
  2. ๊ทธ๋ฆผ์„ค๋ช…๋ชจ๋ธ(Captioner) ํ•™์Šต
    • ๋™์ผํ•œ ์ธ๊ฐ„ ์ฃผ์„ ์Œ (Iโ‚•, Tโ‚•) ๋กœ Image-grounded Text Decoder๋ฅผ LM loss๋กœ ํ•™์Šต
    • ์›น ์ด๋ฏธ์ง€ I๐‘ค์— ๋Œ€ํ•ด ํ•ฉ์„ฑ ์บก์…˜ Tโ‚› ์ƒ์„ฑ
  3. Captioner ๊ฒฐ๊ณผ ์žฌ ํ•„ํ„ฐ๋ง
    • ์ƒ์„ฑ๋œ (I๐‘ค, Tโ‚›) ์„ ๋‹ค์‹œ ํ•„ํ„ฐ์— ํ†ต๊ณผ์‹œ์ผœ ITM ๊ธฐ๋ฐ˜์œผ๋กœ ์ผ์น˜ํ•˜์ง€ ์•Š๋Š” ๊ฒƒ ์ œ๊ฑฐ
    • ์ตœ์ข…์ ์œผ๋กœ ์ •์ œ๋œ ํ•ฉ์„ฑ ์Œ (I๐‘ค, Tโ‚›โ€ฒ) ํš๋“

์ด๋•Œ ์•„๋ž˜ ์ด๋ฏธ์ง€์ฒ˜๋Ÿผ ์ž˜๋ชป๋œ ์›น ํ…์ŠคํŠธ(๋ถ‰์€์ƒ‰)์€ ์ œ๊ฑฐ๋˜๊ณ , ์ƒˆ๋กœ์šด ์บก์…˜(๋…น์ƒ‰)์ด ์ƒ์„ฑ๋ฉ๋‹ˆ๋‹ค! filter

๊ทธ๋ž˜์„œ ์ตœ์ข…์œผ๋กœ ์ถ”์ถœ๋œ ๋ฐ์ดํ„ฐ์…‹์€์€!?

A. ์ธ๊ฐ„ ์ฃผ์„ ์Œ (Iโ‚•, Tโ‚•) B. ํ•„ํ„ฐ๋ง๋œ ์›น ํ…์ŠคํŠธ ์Œ (I๐‘ค,T๐‘คโ€ฒ) C. ์‹ ๊ทœ์ƒ์„ฑ&ํ•„ํ„ฐ๋ง๋œ ์Œ (I๐‘ค, Tโ‚›โ€ฒ)

์ด A/B/C ๋ฐ์ดํ„ฐ๋ฅผ ๊ฐ€์ง€๊ณ  MED๋ชจ๋ธ์„ ๋‹ค์‹œ ํ•™์Šตํ•˜๊ฒŒ ๋ฉ๋‚˜๋‹ค!
์ด ๊ณผ์ •์„ ๋ฐ˜๋ณตํ•˜๋ฉด์„œ ๋ฐ์ดํ„ฐ๋„ ์ ์  ๋” ์ •ํ™•ํ•ด์ง€๊ฒ ๊ณ (Bootstrapping),
MED ๋ชจ๋ธ๋„ ๊ณ ๋„ํ™”๋˜๋Š” ์„ ์ˆœํ™˜์ด ๋˜๊ฒ ์ง€์š”~?

๊ทธ๋ž˜์„œ ์•„๋ž˜ ์ด๋ฏธ์ง€ ์ฒ˜๋Ÿผ, captioner, Filter ๊ฐ€ ์ถ”๊ฐ€๋ ์ˆ˜๋ก ์ •ํ™•๋„๊ฐ€ ์˜ฌ๋ผ๊ฐ‘๋‹ˆ๋‹ค!!
captfilt_res

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
๐Ÿ“ฆ ์ดˆ๊ธฐ ์‚ฌ์ „ํ•™์Šต (Initial MED Pretraining)
โ”œโ”€โ”€ ์ธ๊ฐ„ ์ฃผ์„ ์Œ: (Iโ‚•, Tโ‚•)
โ”œโ”€โ”€ ์›น ์ˆ˜์ง‘ ์Œ: (I๐‘ค, T๐‘ค)
โ””โ”€โ”€ โžค MED ๋ชจ๋ธ ์ดˆ๊ธฐ ํ•™์Šต (์†์‹ค: ITC + ITM + LM)

๐Ÿ”ง CapFilt: ๋ฐ์ดํ„ฐ ์ •์ œ ๋ฐ ๊ตฌ์„ฑ ์š”์†Œ ๊ณ ๋„ํ™”
โ”œโ”€โ”€ Filter ํ•™์Šต
โ”‚   โ”œโ”€โ”€ ๊ตฌ์กฐ: Image-grounded Text Encoder
โ”‚   โ”œโ”€โ”€ ๋ฐ์ดํ„ฐ: (Iโ‚•, Tโ‚•)
โ”‚   โ””โ”€โ”€ ํ•™์Šต ๋ชฉ์ : ITC / ITM loss
โ”‚       โ””โ”€โ”€ ์›น ํ…์ŠคํŠธ ํ•„ํ„ฐ๋ง โ†’ (I๐‘ค, T๐‘คโ€ฒ)
โ”‚
โ”œโ”€โ”€ Captioner ํ•™์Šต
โ”‚   โ”œโ”€โ”€ ๊ตฌ์กฐ: Image-grounded Text Decoder
โ”‚   โ”œโ”€โ”€ ๋ฐ์ดํ„ฐ: (Iโ‚•, Tโ‚•)
โ”‚   โ””โ”€โ”€ ํ•™์Šต ๋ชฉ์ : LM loss
โ”‚       โ””โ”€โ”€ ์›น ์ด๋ฏธ์ง€ ์บก์…˜ ์ƒ์„ฑ โ†’ (I๐‘ค, Tโ‚›)
โ”‚                            โ””โ†’ ํ•„ํ„ฐ ์žฌ๊ฒ€์ฆ โ†’ (I๐‘ค, Tโ‚›โ€ฒ)

๐Ÿง  ์ตœ์ข… ์žฌ์‚ฌ์ „ํ•™์Šต (Re-Pretraining of MED)
โ”œโ”€โ”€ ์ธ๊ฐ„ ์ฃผ์„ ์Œ: (Iโ‚•, Tโ‚•)
โ”œโ”€โ”€ ํ•„ํ„ฐ๋ง๋œ ์›น ํ…์ŠคํŠธ: (I๐‘ค, T๐‘คโ€ฒ)
โ”œโ”€โ”€ ํ•„ํ„ฐ๋ง๋œ ํ•ฉ์„ฑ ์บก์…˜: (I๐‘ค, Tโ‚›โ€ฒ)
โ””โ”€โ”€ โžค ์ด ์ •์ œ ๋ฐ์ดํ„ฐ์…‹ Dโ€ฒ๋กœ MED ์žฌ์‚ฌ์ „ํ•™์Šต

๐Ÿš€ BLIP ๊ฒฐ๊ณผ ๋ถ„์„!!!

analy

BLIP์€ ๋‹ค๋ฅธ ๋ชจ๋ธ์— ๋น„ํ•ด Retreival, Captioning ๋“ฑ์—์„œ ๋ชจ๋‘ ์ตœ๊ณ ์˜€๋‹ค!!


๐Ÿง  ๋งˆ๋ฌด๋ฆฌ ์ƒ๊ฐ

์ง€๊ธˆ์€ ์ด๋ฏธ์ง€๋ฅผ ๋ณด๊ณ  ๋‹ต๋ณ€ํ•˜๋Š” LLM, ์ด๋ฏธ์ง€๋ฅผ ์ƒ์„ฑํ•˜๋Š” MLM ๋“ฑ์ด ๋‹น์—ฐํ•˜์ง€๋งŒ,

2022๋…„ ์‹œ์ ˆ์„ ๋Œ์•„๋ณด๋ฉด ์ด BLIP๋Š” โ€œ์ด๋ฏธ์ง€๋ฅผ ๋ณด๊ณ  ๋งํ•  ์ˆ˜ ์žˆ๋Š” AIโ€๋กœ์˜ ์ง„์ž…์ ์„ ์—ด์–ด์ค€ ์ค‘์š”ํ•œ ๋ชจ๋ธ์ธ๊ฒƒ ๊ฐ™์Šต๋‹ˆ๋‹ค!!
๋‹จ์ˆœํ•œ ์ด๋ฏธ์ง€ ๋ถ„๋ฅ˜๋‚˜ ์ž„๋ฒ ๋”ฉ์„ ๋„˜์–ด์„œ, ์ž์—ฐ์–ด๋ฅผ ํ†ตํ•ด ์‹œ๊ฐ์  ์˜๋ฏธ๋ฅผ ์ƒ์„ฑํ•˜๊ณ  ์†Œํ†ตํ•  ์ˆ˜ ์žˆ๋Š” ๋Šฅ๋ ฅ์„ ๊ฐ–์ถ”๊ธฐ ์‹œ์ž‘ํ•œ ์‹œ์กฐ์„ธ ๋ชจ๋ธ!!

๋˜ํ•œ ์Šค์Šค๋กœ ๋ฐ์ดํ„ฐ๋ฅผ ๋งŒ๋“ค๊ณ  ํ•„ํ„ฐ๋งํ•˜๋ฉฐ ํ•™์Šตํ•ด์„œ ๊ณ ๋„ํ™”ํ•˜๋Š” ๋ถ€ํŠธ์ŠคํŠธ๋ž˜ํ•‘ ๊ธฐ๋ฒ•์€
์ดํ›„ DINO์™€ ๊ฐ™์€ Self supervised ๋А๋‚Œ์œผ๋กœ ์ดํ›„ ๋‹ค๋ฅธ ์—ฐ๊ตฌ์—๋„ ์˜ํ–ฅ์„ ๋ฏธ์ณค์„๊ฒƒ ๊ฐ™์Šต๋‹ˆ๋‹ค!!

์ง€๊ธˆ์˜ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ AI๋ฅผ ์ดํ•ดํ•˜๊ธฐ ์œ„ํ•ด,
๊ทธ ์‹œ์ ˆ ์ค‘์š” ๋ชจ๋ธ์„ ๊ณต๋ถ€ํ•ด๋ณด์•˜์Šต๋‹ˆ๋‹ค! :)


This post is licensed under CC BY 4.0 by the author.