📝Understanding BLIP - BLIP 알아보기?!!

Posted Jun 12, 2025

By DrFirst

18 min read

🧠 Understanding BLIP (in English!)

🔍 It learned even from messy web data, and now it knows how to describe images all on its own!

Paper: BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
Conference: ICML 2022 (Salesforce Research)
Code: salesforce/BLIP

💡 What is BLIP?

Unlike CLIP by OpenAI,
BLIP supports generation-based multimodal tasks, like interactive image descriptions!

1. 🧠 Bidirectional Vision-Language Learning

Most models focus on either understanding or generation
BLIP flexibly supports both directions!

2. 🧹 Bootstrapping Web Data

Think of bootstrapping here as an iterative self-improvement process!

To clean noisy web-collected image-text pairs:
- Captioner: generates synthetic captions
- Filter: removes noisy examples
👉 This enables construction of a much more reliable dataset

3. 🚀 Zero-shot Transferability

Even without fine-tuning, BLIP performs on video-language tasks!
It shows strong generalization ability and versatility

🧠 Why Did BLIP Emerge?

❌ Limitations of Previous Models

Most VLP (Vision-Language Pretraining) models were split:
- Understanding tasks (e.g., image-text retrieval)
- Generation tasks (e.g., captioning)
Encoder-only (e.g., CLIP) or Encoder-Decoder models could not handle both well

🌐 Poor Quality of Web Data

Most VLP models were trained on noisy web-collected pairs
Simple rule-based filtering couldn’t sufficiently clean the data
Scaling up data helped performance, but lower text quality hindered fine-tuned learning

📘 Limitations of Traditional Data Augmentation

In vision: augmentations like cropping, rotation are common
In NLP: it’s much harder to augment text
Existing VLP models used little to no textual augmentation, relying heavily on low-quality alt-text

BLIP overcomes this by generating its own captions using a pretrained LLM,
producing meaningful synthetic data rather than noise!

🚀 BLIP Architecture & Learning

🖇️ BLIP Model Components

BLIP has three key components for understanding, aligning, and generating across image and text.
These are collectively called the “MED” architecture.

✅ Summary of Training Objectives by Module

Category	🟦 ITC (Contrastive)	🟩 ITM (Matching)	🟨 LM (Language Modeling)
🌐 Goal	Align visual & text embeddings	Determine if image-text pair matches	Generate text from image
🧠 Encoding	Encode image & text separately	Inject image into text via cross-attention	Decode text from image with a decoder
⚙️ Mode	Unimodal Encoder	Image-Grounded Text Encoder	Image-Grounded Text Decoder
🔍 Loss	Contrastive Loss (pull/push pairs)	Binary classification (match or not)	Cross-Entropy + autoregressive generation
🎯 Strength	Well-structured latent space for retrieval	Fine-grained alignment of multimodal input	Fluent and informative text generation

🧩 1. Unimodal Encoder (ITC)

Separately encodes image and text
Corresponds to ITC in the figure above

Image encoder: ViT-based (patch + [CLS])
Text encoder: BERT-like, with [CLS]
Use: Retrieval and similarity-based understanding

🧩 2. Image-Grounded Text Encoder (ITM)

Injects image into text encoding via Cross-Attention
Corresponds to ITM in the diagram

Architecture: Transformer with Self-Attn → Cross-Attn → FFN
Special [Encode] token added for multimodal representation
Use: Tasks requiring fine-grained image-text matching

🧩 3. Image-Grounded Text Decoder (LM)

For sequential generation using Causal Self-Attention
Corresponds to LM in the diagram

Left-to-right decoding only
Starts with [Decode] token, ends with [EOS]
Use: Captioning, VQA, generation tasks

SA must be separate due to directionality,
but Embedding, Cross-Attn, and FFN are shared!

Layer Type	Shared?	Notes
Self-Attention (SA)	❌ No	Bi-directional for encoder vs. causal for decoder
Embedding Layer	✅ Yes	Word-to-vector layer is shared
Cross-Attention (CA)	✅ Yes	Connects image and text
Feed Forward Network	✅ Yes	Post-attention computations

See below — sharing everything except SA yields best trade-off!

🧬 CapFilt: Self-Filtering and Captioning of Web Data

Raw Data: Human-annotated image-text pairs (Iₕ, Tₕ)
Initial Training: Train MED model with (Iₕ, Tₕ)
Web Data: Noisy pairs from the internet (I𝑤, T𝑤)

Then:

Train the Filter
- Fine-tune Image-grounded Text Encoder with ITC/ITM loss on (Iₕ, Tₕ)
- Apply it to (I𝑤, T𝑤) to remove noise → (I𝑤, T𝑤′)
Train the Captioner
- Fine-tune Image-grounded Text Decoder with LM loss on (Iₕ, Tₕ)
- Generate synthetic captions for I𝑤 → (I𝑤, Tₛ)
Re-filter synthetic captions
- Apply Filter again to (I𝑤, Tₛ) to remove mismatched results
- Final result: (I𝑤, Tₛ′)

Noisy red text is removed, green synthetic captions are created!

✅ Final Datasets Used to Re-train MED

A. Human-annotated pairs (Iₕ, Tₕ)
B. Filtered web-text pairs (I𝑤, T𝑤′)
C. Filtered synthetic captions (I𝑤, Tₛ′)

These three are merged into a new training dataset for retraining MED!
This forms a bootstrapping loop where both the data and model improve together 💪

And indeed, as shown below — using both Captioner and Filter improves performance!

📦 Initial MED Pretraining
├── Human-labeled pairs: (Iₕ, Tₕ)
├── Web-collected pairs: (I𝑤, T𝑤)
└── ➤ Initial MED training (Losses: ITC + ITM + LM)

🔧 CapFilt: Data Filtering and Module Refinement
├── Filter Training
│   ├── Module: Image-grounded Text Encoder
│   ├── Data: (Iₕ, Tₕ)
│   └── Loss: ITC / ITM
│       └── Result: Filtered text (I𝑤, T𝑤′)
│
├── Captioner Training
│   ├── Module: Image-grounded Text Decoder
│   ├── Data: (Iₕ, Tₕ)
│   └── Loss: LM
│       └── Generate synthetic text (I𝑤, Tₛ)
│                                └→ Filter again → (I𝑤, Tₛ′)

🧠 Final Re-Pretraining of MED
├── Human annotations: (Iₕ, Tₕ)
├── Filtered web text: (I𝑤, T𝑤′)
├── Filtered synthetic captions: (I𝑤, Tₛ′)
└── ➤ Train new MED on clean dataset D′

🚀 Results from BLIP

BLIP outperforms prior models across both retrieval and captioning tasks!

🧠 Final Thoughts

While today’s LLMs easily interpret and generate across modalities,
in 2022, BLIP was a milestone that helped machines “speak based on what they see.”

It went beyond classification or embedding —
and introduced the ability to generate meaning from vision using language.

Its self-bootstrapping of both data and learning likely inspired later self-supervised techniques,
including models like DINO in vision representation learning.

To understand modern multimodal AI,
BLIP is truly a foundational model worth studying! 😊

🧠 (한국어) BLIP 알아보기!

🔍 깔끔하지 않은 웹데이터도로 혼자서도 잘 학습했고, 그림을 보고 해석할 줄 알아요!!!

논문: BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
발표: ICML 2022 (Salesforce Research)
코드: salesforce/BLIP

💡 BLIP 요약!!

이름이 비슷한, OpenAI의 CLIP과의 차이점은!!,
텍스트 생성까지 포함된 대화형 멀티모달 처리도 가능하다는 점!!

🧠 양방향 멀티모달 학습
- 기존 모델들은 이해 or 생성 중 하나에 특화
- BLIP는 두 방향 모두에 유연하게 전이 가능!
🧹 웹 데이터 부트스트래핑
앞으로 이 부트스트래핑은 점진적 개선 이라는 의미로 받아드려주세요!

웹에서 수집한 noisy image-text 쌍 개선을 위해
- Captioner: 문장 생성 (synthetic caption)
- Filter: 노이즈 제거
👉 더 신뢰도 높은 학습 데이터 구성

🚀 제로샷 전이 가능
- 비디오-언어 작업에도 사전학습만으로 직접 적용 가능
- 강력한 범용성과 일반화 능력 보유

🧠 BLIP 등장의 배경

❌ 기존 모델의 이분화된 성능

대부분의 VLP(Vision-Language Pretraining) 모델들은
- 이해 중심 작업 (예: 이미지-텍스트 검색)
- 또는 생성 중심 작업 (예: 이미지 캡션 생성)
  중 한쪽에만 특화되어 있음.
Encoder-only (예: CLIP) 또는 Encoder-Decoder 모델 구조는
두 작업을 동시에 잘 수행하지 못하는 구조적 한계가 있음.

🌐 웹 기반 데이터의 품질 문제

기존 VLP는 대부분 웹에서 수집된 noisy한 이미지-텍스트 쌍으로 학습됨.
단순한 규칙 기반 필터링만으로는 노이즈 제거가 불완전함.
데이터 규모를 키우면 성능은 증가하지만,
텍스트 품질 저하가 세밀한 학습에 장애가 됨.

📘 Data Augmentation 에서의 한계

기존 방식:
- 컴퓨터 비전에서는 이미지 회전, 크롭 등의 데이터 증강이 일반적이지만,
- 자연어 처리에서는 증강이 까다로움.
기존 VLP에서는 언어 기반 증강이 거의 적용되지 않거나,
저품질 웹 텍스트에 의존
BLIP은 LLM으로 캡션을 만들어, 노이즈가 아닌 진짜 의미있는 데이터를 얻어냄!!

🚀 BLIP 모델 구조와 학습!

🖇️ BLIP 모델 구조

BLIP은 이미지 텍스트를 각각 이해하는거랑, 함꼐 이해하는거랑, 이미지를 해석하는것 모두 하기위해 3개의 모듈이 존재!!
이를 “MED” 라고 부른다!!

✅ BLIP 모델별 학습 목적 요약

항목	🟦 ITC (Image-Text Contrastive)	🟩 ITM (Image-Text Matching)	🟨 LM (Language Modeling)
🌐 목적	표현 공간 정렬 (similarity)	정답/오답 판별 (match 여부)	이미지 기반 텍스트 생성 (텍스트 생성 능력)
🧠 인코딩 방식	이미지와 텍스트를 각각 따로 인코딩	이미지+텍스트 쌍을 cross-attention으로 주입	이미지 + 디코더에서 텍스트 생성
⚙️ 사용 모드	Unimodal Encoder	Image-Grounded Text Encoder	Image-Grounded Text Decoder
🔍 학습 방식	Contrastive Loss (양성: 가까이 / 음성: 멀리)	Binary Classification (양성/음성 분류)	Cross-Entropy Loss + 오토리그레시브 생성
🎯 특징	표현 공간 구조화 → 검색에 유리	멀티모달 의미 정렬 → fine-grained 대응 가능	자연스러운 문장 생성 능력, 캡션/질문응답 강화

🧩 1. Unimodal Encoder

이미지와 텍스트를 각각 독립적으로 해석
위의 이미지에서 ITC(image-text contrastive)에 해당

이미지 인코더: ViT 기반 (Patch + [CLS])
텍스트 인코더: BERT 구조, 입력에 [CLS] 토큰 포함
사용 용도: 이미지-텍스트 매칭, retrieval 등의 이해 중심 태스크

🧩 2. Image-Grounded Text Encoder

텍스트 인코더 내부에 Cross-Attention을 삽입하여 이미지 정보를 주입
위의 이미지에서 ITM(a image-text matching)에 해당

구조: 각 Transformer block에 Self-Attention → Cross-Attention → FFN 순서
[Encode] 토큰을 텍스트 입력 끝에 추가 → 이 토큰의 출력이 멀티모달 표현
사용 용도: 세밀한 이미지-텍스트 의미 정렬이 필요한 경우

🧩 3. Image-Grounded Text Decoder

순차적 텍스트 생성을 위한 Causal Self-Attention 구조
위의 이미지에서 LM(language modeling)에 해당

구조: 양방향이 아닌 왼쪽에서 오른쪽으로만 흐르는 Attention
[Decode] 토큰으로 시작, [EOS]로 종료
사용 용도: 이미지 캡셔닝, 질문 응답 등 생성 기반 태스크

🔄 모델 내의 파라미터 공유 여부

SA는 인코더와 디코더의 “눈의 방향”이 달라서 따로 필요하지만,
Embedding·CA·FFN은 “머리 속 계산기”는 같기 때문에 공유한다!!

레이어 종류	공유 여부	설명
Self-Attention (SA)	❌ 미공유	인코더는 양방향, 디코더는 인과적이기 때문
Embedding Layer	✅ 공유	단어를 벡터로 변환하는 부분
Cross-Attention (CA)	✅ 공유	이미지-텍스트 연결 부분
Feed Forward Network(FFN)	✅ 공유	인코딩/디코딩 후 처리하는 계산 모듈

공유되는 부분 덕분에 파라미터도 절감되고!(경량화), 학습 효율 및 비용도 감소합니다!!

아래 이미지를 통해 왜 이렇게 공유하게 되었는지 테스트한 결과도 보여줍니다!
SA 만 빼고 공유할때가 파라미터는 작으면서도 결과가 좋지요!?

🧬 CapFilt : 웹 데이터 부트스트래핑을 위한, 잘 해석달기 필터!

사전 재료 : 인간이 annotate 한 이미지&텍스트 (Iₕ, Tₕ)
사전 학습모델 : 사전재료를 바탕으로 MED 모델 학습 시킴!!
사용할 재료 : 웹에서 수집된 이미지&텍스트 쌍 (I𝑤, T𝑤)

이제 본격 시작해서!!

필터를 학습
- 인간 주석 쌍 (Iₕ, Tₕ) 로 Image-grounded Text Encoder를 ITC/ITM loss로 학습
- 웹 텍스트 쌍 (I𝑤, T𝑤)에 대해 노이즈를 제거하여 정제된 쌍 (I𝑤, T𝑤′) 생성
그림설명모델(Captioner) 학습
- 동일한 인간 주석 쌍 (Iₕ, Tₕ) 로 Image-grounded Text Decoder를 LM loss로 학습
- 웹 이미지 I𝑤에 대해 합성 캡션 Tₛ 생성
Captioner 결과 재 필터링
- 생성된 (I𝑤, Tₛ) 을 다시 필터에 통과시켜 ITM 기반으로 일치하지 않는 것 제거
- 최종적으로 정제된 합성 쌍 (I𝑤, Tₛ′) 획득

이때 아래 이미지처럼 잘못된 웹 텍스트(붉은색)은 제거되고, 새로운 캡션(녹색)이 생성됩니다!

그래서 최종으로 추출된 데이터셋은은!?

A. 인간 주석 쌍 (Iₕ, Tₕ) B. 필터링된 웹 텍스트 쌍 (I𝑤,T𝑤′) C. 신규생성&필터링된 쌍 (I𝑤, Tₛ′)

이 A/B/C 데이터를 가지고 MED모델을 다시 학습하게 됩나다!
이 과정을 반복하면서 데이터도 점점 더 정확해지겠고(Bootstrapping),
MED 모델도 고도화되는 선순환이 되겠지요~?

그래서 아래 이미지 처럼, captioner, Filter 가 추가될수록 정확도가 올라갑니다!!

📦 초기 사전학습 (Initial MED Pretraining)
├── 인간 주석 쌍: (Iₕ, Tₕ)
├── 웹 수집 쌍: (I𝑤, T𝑤)
└── ➤ MED 모델 초기 학습 (손실: ITC + ITM + LM)

🔧 CapFilt: 데이터 정제 및 구성 요소 고도화
├── Filter 학습
│   ├── 구조: Image-grounded Text Encoder
│   ├── 데이터: (Iₕ, Tₕ)
│   └── 학습 목적: ITC / ITM loss
│       └── 웹 텍스트 필터링 → (I𝑤, T𝑤′)
│
├── Captioner 학습
│   ├── 구조: Image-grounded Text Decoder
│   ├── 데이터: (Iₕ, Tₕ)
│   └── 학습 목적: LM loss
│       └── 웹 이미지 캡션 생성 → (I𝑤, Tₛ)
│                            └→ 필터 재검증 → (I𝑤, Tₛ′)

🧠 최종 재사전학습 (Re-Pretraining of MED)
├── 인간 주석 쌍: (Iₕ, Tₕ)
├── 필터링된 웹 텍스트: (I𝑤, T𝑤′)
├── 필터링된 합성 캡션: (I𝑤, Tₛ′)
└── ➤ 이 정제 데이터셋 D′로 MED 재사전학습

🚀 BLIP 결과 분석!!!

BLIP은 다른 모델에 비해 Retreival, Captioning 등에서 모두 최고였다!!

🧠 마무리 생각

지금은 이미지를 보고 답변하는 LLM, 이미지를 생성하는 MLM 등이 당연하지만,

2022년 시절을 돌아보면 이 BLIP는 “이미지를 보고 말할 수 있는 AI”로의 진입점을 열어준 중요한 모델인것 같습니다!!
단순한 이미지 분류나 임베딩을 넘어서, 자연어를 통해 시각적 의미를 생성하고 소통할 수 있는 능력을 갖추기 시작한 시조세 모델!!

또한 스스로 데이터를 만들고 필터링하며 학습해서 고도화하는 부트스트래핑 기법은
이후 DINO와 같은 Self supervised 느낌으로 이후 다른 연구에도 영향을 미쳤을것 같습니다!!

지금의 멀티모달 AI를 이해하기 위해,
그 시절 중요 모델을 공부해보았습니다! :)

AI, Research

This post is licensed under CC BY 4.0 by the author.