📝 Understanding CLIP - CLIP 모델 이해하기

Posted Apr 6, 2025

By DrFirst

17 min read

(English-ver) Understanding CLIP - A Beginner-Friendly Guide to Contrastive Language–Image Pretraining

Hello! 😊
Today, let’s dive into CLIP (Contrastive Language–Image Pre-training), a powerful multimodal model released by OpenAI.
While multimodal models are now the norm, it’s amazing to realize this research came out in 2021—before ChatGPT! Let’s explore what makes CLIP so groundbreaking.

An Easy Way to Understand CLIP

John has 1 kg of gold.
Alex has 1 Bitcoin.
How can we compare their value fairly?
A model like CLIP helps transform these different assets into the same scale, like dollars ($), for direct comparison.
In this example, gold = text, Bitcoin = image, and the $ represents the shared vector space CLIP learns!

🎯 What is CLIP?

CLIP is a model trained to understand both images and natural language descriptions.

It was introduced by OpenAI in 2021 and differs from traditional image classifiers by connecting images with free-form natural language.

📘 CLIP stands for: Contrastive Language–Image Pre-training

🧠 Key Idea of CLIP

CLIP maps both images and texts into a shared vector space, where matching image-text pairs are close, and unrelated ones are distant.

✨ In Simple Terms:

An Image Encoder to embed images
A Text Encoder to embed text
Both map to the same space, where semantically related inputs are nearby

This is called Contrastive Learning. (We’ll explain more below!)

🔍 Under the Hood

🖼️ Image Encoder (ViT / ResNet Based)

Feature	Description
Architecture	Vision Transformer (ViT), ResNet
ViT Models	ViT-B/32, ViT-B/16, ViT-L/14, ViT-L/14@336px
ResNet Models	ResNet-50, ResNet-101, RN50x4, RN50x16, RN50x64
Input Resolution	224×224 (default), 336×336 for ViT-L/14@336px
ViT Flow	Image → patches → linear embedding + positional encoding
ResNet Flow	Convolution + residual blocks
Feature Output	ViT: [CLS] token, ResNet: global average pooling
Embedding Size	512–1024
Training Time	ViT-L/14: 12 days (256× V100), RN50x64: 18 days (592× V100)

🧠 Text Encoder (Transformer Based)

Feature	Description
Architecture	Transformer (GPT-style)
Layers	12
Hidden Size	512
Attention Heads	8
Parameters	~63M
Tokenizer	Byte Pair Encoding (BPE), vocab size 49,152
Max Sequence Length	76 tokens (+ [SOS], [EOS])
Input Format	[SOS] + tokens + [EOS]
Output Feature	Final layer output at [EOS]
Post-processing	LayerNorm → Linear projection
Masking	Masked self-attention (causal)

🔧 How Was CLIP Trained?

CLIP was trained on 400 million image–text pairs from the internet.
It used:

Batch Size: 32,768
Epochs: 32

🧠 What is Contrastive Pretraining?

Contrastive Pretraining helps a model learn to pull similar pairs together and push dissimilar pairs apart in the embedding space.

Example:
Image: “a photo of a cat” → CLIP → (0, 0, 0, 0.99)
Text: “a photo of a cat” → CLIP → (0, 0, 0, 0.98)
The vectors are close—great match!

📉 InfoNCE Loss

CLIP uses InfoNCE Loss, derived from Noise Contrastive Estimation.

✅ Formula:

L = -log( exp(sim(x, y⁺) / τ) / Σᵢ exp(sim(x, yᵢ) / τ) )

sim(x, y): cosine similarity between image and text vectors
y⁺: correct text pair
yᵢ: all other incorrect (negative) pairs in the batch
τ: temperature parameter (e.g., 0.07)

It encourages the model to maximize the similarity for correct pairs and minimize it for incorrect ones.

🧪 Applications of Contrastive Learning

CLIP: Image–text alignment
SimCLR: Augmented image pairs
ALIGN: Caption–image alignment
DINO, MoCo: Self-supervised learning

🎯 Summary Table

Concept	Description
Goal	Pull positives close, push negatives apart
Loss	InfoNCE
Label-Free	Yes
Applications	Multimodal search, zero-shot tasks, representation learning

💡 Real-World Uses of CLIP

Application	Description
🖼️ Zero-shot Image Classification	Use natural language like `"a photo of a cat"` without fixed labels
🔍 Text-to-Image Search	Search images that best match text queries
🎨 Text-to-Image Generation	Used as a backbone in models like DALL·E
🧪 Multimodal Research	Foundation for vision + language studies

🔍 Compared to Traditional Models

Feature	Traditional Models (e.g. ViT)	CLIP
Input	Image only	Image + Text
Class Definition	Predefined labels	Free-form text
Flexibility	Needs retraining	Zero-shot via prompt change

📈 Why CLIP Is Important

Versatility: One model, many tasks
Zero-shot power: No retraining required
Foundation of Multimodal AI
Text-controlled vision systems

🧠 Limitations of CLIP

Bias: Learned from biased internet data
Sensitive to phrasing: May confuse similar text like "man riding a horse" vs "horse riding a man"
Text overreliance: May depend too heavily on text when not needed
Typographic Attacks: Misleads based on visible text in images

It’s clearly an apple, but “iPod” written on it tricks the model!

🔗 References

✍️ Final Thoughts

CLIP changed the way we approach image and text understanding.
It laid the foundation for models like DALL·E, Stable Diffusion, Flamingo, and more.

👉 If you’re diving into multimodal AI, CLIP is a must-understand model.

Thanks for reading! Feel free to leave questions or suggestions 💬

(한국어) CLIP 모델 이해하기

안녕하세요! 😊
오늘은 OpenAI에서 발표한 강력한 멀티모달 모델, CLIP (Contrastive Language–Image Pre-training)에 대해 알아보려 합니다. 지금은 멀티모델 모델이 너무나도 당연한 것이지만!! 이 연구가 2021년,, chatGPT가 나오기 전의 시점임을 생각하며 이 놀라움에 대하여 탐구해보아요~!

쉽게 CLIP이해하기!!

길동이는 황금 1kg이 있습니다. 그리고 영수는 비트코인이 1개가 있어요! 이 둘이 가지고있는 자산의 가치 어떻게 비교할수 있을까요? CLIP이라는 모델은 이 두 자산(황금, 비트코인)을 동일한 수직선에서 비교할 수 있도록 $로 변환해줍니다!
이때!! 황금=텍스트, 비트코인=이미지 의 예시이고, 동일한 수직선 역할을 하는 $가 CLIP에서는 같은 차원의 결과물 벡터입니다!!

🎯 CLIP이란?

CLIP은 이미지와 텍스트를 동시에 이해할 수 있도록 훈련된 AI 모델입니다.

OpenAI가 2021년에 발표했으며, 기존의 이미지 분류 모델과는 달리 이미지를 자연어 설명과 연결시킬 수 있다는 점에서 큰 주목을 받았습니다.

📘 CLIP은 무슨 약자일까!? = Contrastive Language–Image Pre-training

🧠 CLIP의 핵심 아이디어

CLIP은 이미지와 텍스트를 같은 벡터 공간(embedding space)에 매핑하여,
이미지와 설명이 서로 얼마나 잘 맞는지를 학습합니다.

✨ 간단히 요약하면:

이미지를 인코딩하는 Image Encoder
텍스트를 인코딩하는 Text Encoder
이 둘을 같은 공간으로 매핑하여 서로 잘 맞는 쌍은 가까이, 아닌 쌍은 멀리 떨어지게 학습합니다.

이 과정을 “Contrastive Learning”이라고 합니다. (뒷부분에서 더 자세히 알아봐요!)

✨✨ 조금 더 자세히 설명하자면:

이미지를 인코딩하는 Image Encoder 는, Vision Transformer로, 이 논문에서는 resnet과 ViT를 바탕으로 테스트했고, 최종적으로 CLIP-ViT-B/32 모델이 가장 좋다고 평가했습니다!

🖼️ CLIP Image Encoder (ViT / ResNet 기반)

항목	내용
사용된 아키텍처	Vision Transformer (ViT) 및 ResNet 계열
ViT 모델 종류	ViT-B/32, ViT-B/16, ViT-L/14, ViT-L/14@336px
ResNet 모델 종류	ResNet-50, ResNet-101, RN50x4, RN50x16, RN50x64
입력 해상도	기본 224×224, 일부 모델은 336×336
ViT 구조 특징	이미지 → 패치 분할 → Linear Embedding + Positional Encoding
ResNet 구조 특징	표준 Conv + Residual block 기반 CNN
특성 추출 방식	ViT: [CLS] 토큰 사용 ResNet: 글로벌 평균 풀링
출력 임베딩 차원	512~1024 (모델에 따라 다름)
학습 시간 예시	ViT-L/14: 12일 (256× V100), RN50x64: 18일 (592× V100)

텍스트를 인코딩하는 Text Encoder, 우리가 잘 알고있는 Transformer로 왼쪽에서 오른쪽으로 처리하는 causal LM 방식으로 텍스트를 벡터화하였습니다!

🧠 CLIP Text Encoder (Transformer 기반)

항목	내용
아키텍처	Transformer (GPT-style)
레이어 수	12 layers
Hidden size	512
Attention Heads	8
파라미터 수	약 63M
어휘 사전	Byte Pair Encoding (BPE), vocab size 49,152
입력 길이 제한	76 tokens + [SOS], [EOS]
입력 형식	[SOS] + BPE tokens + [EOS]
특성 추출 방식	[EOS] 위치의 마지막 레이어 출력 사용
후처리	LayerNorm → Linear projection
Attention Mask	Masked Self-Attention (GPT-style)

이 둘을 같은 공간으로 매핑하여 서로 잘 맞는 쌍은 가까이, 아닌 쌍은 멀리 떨어지게 학습합니다.

이 과정을 “Contrastive Learning”이라고 합니다.

🔧 어떻게 학습되었을까?

CLIP은 인터넷에서 수집한 4억 쌍의 이미지–텍스트 데이터로 훈련되었습니다.
32,768개의 batch로 학습되었고, 32 epochs을 진행했다고합니다.

지금까지의 내용이 OpenAI의 프레젠테이션 자료에 잘 요약되어있습니다!

🧠 Contrastive Pretraining에 대하여 더 알아보기!!

Contrastive pretraining은 모델이 비슷한 쌍은 가깝게, 다른 쌍은 멀게 임베딩 공간에서 학습하도록 하는 사전 학습 기법입니다.
주로 이미지–텍스트, 이미지–이미지, 문장–문장처럼 짝을 이루는 데이터 쌍을 기반으로 학습하며,
표현(embedding) 학습에 매우 효과적입니다.

예-이미지 : “고양이 사진” -> CLIP 모델 -> (0,0,0,0.99) 예-텍스트 : “a photo of a cat” -> CLIP 모델 -> (0,0,0,0.98) 위와 같이 이미지와 텍스트의 벡터들이 서로 가까운 벡터가 CLIP 모델을 학습시키기!!

📉 Contrastive Learning에서 사용하는 Loss: InfoNCE Loss

대조 학습에서는 InfoNCE Loss(Noise-Contrastive Estimation 기반 손실)를 주로 사용합니다.

✅ InfoNCE Loss란?

Positive Pair: 서로 관련 있는 쌍 (예: 이미지와 그 설명)
Negative Pairs: 나머지 무관한 모든 쌍

모델은 positive pair의 유사도(similarity)는 높이고,
negative pair의 유사도는 낮추는 방향으로 학습됩니다.

📐 수식 개요

 L = -log( exp(sim(x, y⁺) / τ) / Σᵢ exp(sim(x, yᵢ) / τ) )

sim(x, y): 이미지, 텍스트 벡터의 코사인유사도!!
y⁺: 올바른 텍스트 쌍
yᵢ: 같은 배치 내 다른 텍스트들 (negative)
τ: temperature scaling factor (보통 0.07)

InfoNCE는 결국 정답 쌍이 전체 중 얼마나 상대적으로 잘 맞는지를 확률처럼 모델링하여 학습하는 방식입니다.

🔧 실제 활용 예시

CLIP: 이미지–텍스트 임베딩 정렬
SimCLR: 이미지–이미지 Augmented Pair
ALIGN: 이미지–캡션 매칭
DINO, MoCo: 자가 지도 학습 기반 임베딩 학습

🎯 요약

개념	설명
목적	Positive는 가깝게, Negative는 멀게
손실 함수	InfoNCE Loss
특징	라벨 없이도 유사성 기반 학습 가능
응용	멀티모달, 검색, Zero-shot 등 다양한 태스크

💡 CLIP의 활용 예시

CLIP은 단순한 이미지 분류를 넘어 다양한 방식으로 활용 가능합니다:

활용	설명
🖼️ Zero-shot 이미지 분류	사전 정의된 클래스 없이 텍스트만으로 분류 가능 (`"a photo of a dog"`, `"a photo of a cat"`)
🔍 텍스트 기반 이미지 검색	`"a person riding a horse"`와 가장 잘 맞는 이미지를 검색
🎨 텍스트 → 이미지 생성 보조	DALL·E 같은 생성 모델의 보조 역할
🧪 멀티모달 연구 기반	이미지와 텍스트를 함께 다루는 연구의 출발점으로 자주 쓰임

예시!: 이미지 스타일바꾸기

최근엔 익숙해진 이미지 스타일 바꾸기!! 도결국 이 CLIP에서 시작됬다고 볼수 있습니다~! —

🔍 기존 모델과의 차이점

항목	기존 이미지 분류 모델 (ViT, Resnet)	CLIP
입력	이미지만 사용	이미지 + 텍스트
분류 기준	고정된 라벨(class)	자유로운 자연어
확장성	클래스 추가 시 재학습 필요	문장만 바꾸면 Zero-shot 적용 가능

📈 CLIP이 중요한 이유

범용성: 한 번 훈련된 모델로 다양한 태스크에 적용 가능
Zero-shot 성능: 새로운 클래스를 재학습 없이 처리 가능
멀티모달 AI의 시작점: 텍스트와 이미지의 공동 표현 공간을 다루는 기반이 됨
텍스트 기반 제어: 사용자가 원하는 이미지를 텍스트로 제시할 수 있게 함

🧠 CLIP의 한계점은?

Bias 문제: 웹 데이터 기반 학습이라, 인간의 편향이 모델에도 반영될 수 있음
세밀한 문장 구분은 어려움: "a man riding a horse"와 "a horse riding a man"을 명확히 구분하지 못할 수 있음
텍스트에 지나치게 의존: 시각 정보만으로는 처리 가능한 경우에도 텍스트에 의존할 수 있음
Typographic Attack : 이미지에 텍스트가 들어가있으면 잘못인식함!!
사과이지만 iPod라는 텍스트가 써있으니 iPod로 인식하는 한계가 발견되었습니다!!

🔗 참고 자료 및 코드

✍️ 마무리하며

CLIP은 단순한 이미지 분류를 넘어서, 이미지와 텍스트를 연결하는 사고방식 자체를 바꾼 모델입니다.
오늘날 DALL·E, Stable Diffusion, Flamingo 등 다양한 멀티모달 모델의 기반이 되었죠.

👉 앞으로 멀티모달 AI에 관심이 있다면, CLIP은 꼭 이해하고 넘어가야 할 핵심 모델입니다!

감사합니다! 궁금한 점이나 더 알고 싶은 주제가 있다면 댓글로 남겨주세요 💬

AI, Research

This post is licensed under CC BY 4.0 by the author.