Post

๐Ÿ“ Understanding CLIP - CLIP ๋ชจ๋ธ ์ดํ•ดํ•˜๊ธฐ

๐Ÿ“ Understanding CLIP - CLIP ๋ชจ๋ธ ์ดํ•ดํ•˜๊ธฐ

(English-ver) Understanding CLIP - A Beginner-Friendly Guide to Contrastive Languageโ€“Image Pretraining

Hello! ๐Ÿ˜Š
Today, letโ€™s dive into CLIP (Contrastive Languageโ€“Image Pre-training), a powerful multimodal model released by OpenAI.
While multimodal models are now the norm, itโ€™s amazing to realize this research came out in 2021โ€”before ChatGPT! Letโ€™s explore what makes CLIP so groundbreaking.


An Easy Way to Understand CLIP

clip_manhwa_en

John has 1 kg of gold.
Alex has 1 Bitcoin.
How can we compare their value fairly?
A model like CLIP helps transform these different assets into the same scale, like dollars ($), for direct comparison.

In this example, gold = text, Bitcoin = image, and the $ represents the shared vector space CLIP learns!


๐ŸŽฏ What is CLIP?

CLIP is a model trained to understand both images and natural language descriptions.

CLIP_paper

It was introduced by OpenAI in 2021 and differs from traditional image classifiers by connecting images with free-form natural language.

๐Ÿ“˜ CLIP stands for: Contrastive Languageโ€“Image Pre-training


๐Ÿง  Key Idea of CLIP

CLIP maps both images and texts into a shared vector space, where matching image-text pairs are close, and unrelated ones are distant.

โœจ In Simple Terms:

  • An Image Encoder to embed images
  • A Text Encoder to embed text
  • Both map to the same space, where semantically related inputs are nearby

This is called Contrastive Learning. (Weโ€™ll explain more below!)


๐Ÿ” Under the Hood

๐Ÿ–ผ๏ธ Image Encoder (ViT / ResNet Based)

FeatureDescription
ArchitectureVision Transformer (ViT), ResNet
ViT ModelsViT-B/32, ViT-B/16, ViT-L/14, ViT-L/14@336px
ResNet ModelsResNet-50, ResNet-101, RN50x4, RN50x16, RN50x64
Input Resolution224ร—224 (default), 336ร—336 for ViT-L/14@336px
ViT FlowImage โ†’ patches โ†’ linear embedding + positional encoding
ResNet FlowConvolution + residual blocks
Feature OutputViT: [CLS] token, ResNet: global average pooling
Embedding Size512โ€“1024
Training TimeViT-L/14: 12 days (256ร— V100), RN50x64: 18 days (592ร— V100)

๐Ÿง  Text Encoder (Transformer Based)

FeatureDescription
ArchitectureTransformer (GPT-style)
Layers12
Hidden Size512
Attention Heads8
Parameters~63M
TokenizerByte Pair Encoding (BPE), vocab size 49,152
Max Sequence Length76 tokens (+ [SOS], [EOS])
Input Format[SOS] + tokens + [EOS]
Output FeatureFinal layer output at [EOS]
Post-processingLayerNorm โ†’ Linear projection
MaskingMasked self-attention (causal)

๐Ÿ”ง How Was CLIP Trained?

CLIP was trained on 400 million imageโ€“text pairs from the internet.
It used:

  • Batch Size: 32,768
  • Epochs: 32

presentation


๐Ÿง  What is Contrastive Pretraining?

contrastive_pretraining

Contrastive Pretraining helps a model learn to pull similar pairs together and push dissimilar pairs apart in the embedding space.

Example:
Image: โ€œa photo of a catโ€ โ†’ CLIP โ†’ (0, 0, 0, 0.99)
Text: โ€œa photo of a catโ€ โ†’ CLIP โ†’ (0, 0, 0, 0.98)
The vectors are closeโ€”great match!


๐Ÿ“‰ InfoNCE Loss

CLIP uses InfoNCE Loss, derived from Noise Contrastive Estimation.

โœ… Formula:

L = -log( exp(sim(x, yโบ) / ฯ„) / ฮฃแตข exp(sim(x, yแตข) / ฯ„) )
  • sim(x, y): cosine similarity between image and text vectors
  • yโบ: correct text pair
  • yแตข: all other incorrect (negative) pairs in the batch
  • ฯ„: temperature parameter (e.g., 0.07)

It encourages the model to maximize the similarity for correct pairs and minimize it for incorrect ones.


๐Ÿงช Applications of Contrastive Learning

  • CLIP: Imageโ€“text alignment
  • SimCLR: Augmented image pairs
  • ALIGN: Captionโ€“image alignment
  • DINO, MoCo: Self-supervised learning

๐ŸŽฏ Summary Table

ConceptDescription
GoalPull positives close, push negatives apart
LossInfoNCE
Label-FreeYes
ApplicationsMultimodal search, zero-shot tasks, representation learning

๐Ÿ’ก Real-World Uses of CLIP

ApplicationDescription
๐Ÿ–ผ๏ธ Zero-shot Image ClassificationUse natural language like "a photo of a cat" without fixed labels
๐Ÿ” Text-to-Image SearchSearch images that best match text queries
๐ŸŽจ Text-to-Image GenerationUsed as a backbone in models like DALLยทE
๐Ÿงช Multimodal ResearchFoundation for vision + language studies

style_change


๐Ÿ” Compared to Traditional Models

FeatureTraditional Models (e.g. ViT)CLIP
InputImage onlyImage + Text
Class DefinitionPredefined labelsFree-form text
FlexibilityNeeds retrainingZero-shot via prompt change

๐Ÿ“ˆ Why CLIP Is Important

  1. Versatility: One model, many tasks
  2. Zero-shot power: No retraining required
  3. Foundation of Multimodal AI
  4. Text-controlled vision systems

๐Ÿง  Limitations of CLIP

  • Bias: Learned from biased internet data
  • Sensitive to phrasing: May confuse similar text like "man riding a horse" vs "horse riding a man"
  • Text overreliance: May depend too heavily on text when not needed
  • Typographic Attacks: Misleads based on visible text in images

typo_attack

Itโ€™s clearly an apple, but โ€œiPodโ€ written on it tricks the model!


๐Ÿ”— References


โœ๏ธ Final Thoughts

CLIP changed the way we approach image and text understanding.
It laid the foundation for models like DALLยทE, Stable Diffusion, Flamingo, and more.

๐Ÿ‘‰ If youโ€™re diving into multimodal AI, CLIP is a must-understand model.

Thanks for reading! Feel free to leave questions or suggestions ๐Ÿ’ฌ


(ํ•œ๊ตญ์–ด) CLIP ๋ชจ๋ธ ์ดํ•ดํ•˜๊ธฐ

์•ˆ๋…•ํ•˜์„ธ์š”! ๐Ÿ˜Š
์˜ค๋Š˜์€ OpenAI์—์„œ ๋ฐœํ‘œํ•œ ๊ฐ•๋ ฅํ•œ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ๋ชจ๋ธ, CLIP (Contrastive Languageโ€“Image Pre-training)์— ๋Œ€ํ•ด ์•Œ์•„๋ณด๋ ค ํ•ฉ๋‹ˆ๋‹ค. ์ง€๊ธˆ์€ ๋ฉ€ํ‹ฐ๋ชจ๋ธ ๋ชจ๋ธ์ด ๋„ˆ๋ฌด๋‚˜๋„ ๋‹น์—ฐํ•œ ๊ฒƒ์ด์ง€๋งŒ!! ์ด ์—ฐ๊ตฌ๊ฐ€ 2021๋…„,, chatGPT๊ฐ€ ๋‚˜์˜ค๊ธฐ ์ „์˜ ์‹œ์ ์ž„์„ ์ƒ๊ฐํ•˜๋ฉฐ ์ด ๋†€๋ผ์›€์— ๋Œ€ํ•˜์—ฌ ํƒ๊ตฌํ•ด๋ณด์•„์š”~!


์‰ฝ๊ฒŒ CLIP์ดํ•ดํ•˜๊ธฐ!!

clip_manhwa_kr

๊ธธ๋™์ด๋Š” ํ™ฉ๊ธˆ 1kg์ด ์žˆ์Šต๋‹ˆ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ์˜์ˆ˜๋Š” ๋น„ํŠธ์ฝ”์ธ์ด 1๊ฐœ๊ฐ€ ์žˆ์–ด์š”! ์ด ๋‘˜์ด ๊ฐ€์ง€๊ณ ์žˆ๋Š” ์ž์‚ฐ์˜ ๊ฐ€์น˜ ์–ด๋–ป๊ฒŒ ๋น„๊ตํ• ์ˆ˜ ์žˆ์„๊นŒ์š”? CLIP์ด๋ผ๋Š” ๋ชจ๋ธ์€ ์ด ๋‘ ์ž์‚ฐ(ํ™ฉ๊ธˆ, ๋น„ํŠธ์ฝ”์ธ)์„ ๋™์ผํ•œ ์ˆ˜์ง์„ ์—์„œ ๋น„๊ตํ•  ์ˆ˜ ์žˆ๋„๋ก $๋กœ ๋ณ€ํ™˜ํ•ด์ค๋‹ˆ๋‹ค!

์ด๋•Œ!! ํ™ฉ๊ธˆ=ํ…์ŠคํŠธ, ๋น„ํŠธ์ฝ”์ธ=์ด๋ฏธ์ง€ ์˜ ์˜ˆ์‹œ์ด๊ณ , ๋™์ผํ•œ ์ˆ˜์ง์„  ์—ญํ• ์„ ํ•˜๋Š” $๊ฐ€ CLIP์—์„œ๋Š” ๊ฐ™์€ ์ฐจ์›์˜ ๊ฒฐ๊ณผ๋ฌผ ๋ฒกํ„ฐ์ž…๋‹ˆ๋‹ค!!

๐ŸŽฏ CLIP์ด๋ž€?

CLIP์€ ์ด๋ฏธ์ง€์™€ ํ…์ŠคํŠธ๋ฅผ ๋™์‹œ์— ์ดํ•ดํ•  ์ˆ˜ ์žˆ๋„๋ก ํ›ˆ๋ จ๋œ AI ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค.

CLIP_paper OpenAI๊ฐ€ 2021๋…„์— ๋ฐœํ‘œํ–ˆ์œผ๋ฉฐ, ๊ธฐ์กด์˜ ์ด๋ฏธ์ง€ ๋ถ„๋ฅ˜ ๋ชจ๋ธ๊ณผ๋Š” ๋‹ฌ๋ฆฌ ์ด๋ฏธ์ง€๋ฅผ ์ž์—ฐ์–ด ์„ค๋ช…๊ณผ ์—ฐ๊ฒฐ์‹œํ‚ฌ ์ˆ˜ ์žˆ๋‹ค๋Š” ์ ์—์„œ ํฐ ์ฃผ๋ชฉ์„ ๋ฐ›์•˜์Šต๋‹ˆ๋‹ค.

๐Ÿ“˜ CLIP์€ ๋ฌด์Šจ ์•ฝ์ž์ผ๊นŒ!? = Contrastive Languageโ€“Image Pre-training

๐Ÿง  CLIP์˜ ํ•ต์‹ฌ ์•„์ด๋””์–ด

CLIP์€ ์ด๋ฏธ์ง€์™€ ํ…์ŠคํŠธ๋ฅผ ๊ฐ™์€ ๋ฒกํ„ฐ ๊ณต๊ฐ„(embedding space)์— ๋งคํ•‘ํ•˜์—ฌ,
์ด๋ฏธ์ง€์™€ ์„ค๋ช…์ด ์„œ๋กœ ์–ผ๋งˆ๋‚˜ ์ž˜ ๋งž๋Š”์ง€๋ฅผ ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค.

โœจ ๊ฐ„๋‹จํžˆ ์š”์•ฝํ•˜๋ฉด:

  • ์ด๋ฏธ์ง€๋ฅผ ์ธ์ฝ”๋”ฉํ•˜๋Š” Image Encoder
  • ํ…์ŠคํŠธ๋ฅผ ์ธ์ฝ”๋”ฉํ•˜๋Š” Text Encoder
  • ์ด ๋‘˜์„ ๊ฐ™์€ ๊ณต๊ฐ„์œผ๋กœ ๋งคํ•‘ํ•˜์—ฌ ์„œ๋กœ ์ž˜ ๋งž๋Š” ์Œ์€ ๊ฐ€๊นŒ์ด, ์•„๋‹Œ ์Œ์€ ๋ฉ€๋ฆฌ ๋–จ์–ด์ง€๊ฒŒ ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค.

์ด ๊ณผ์ •์„ โ€œContrastive Learningโ€์ด๋ผ๊ณ  ํ•ฉ๋‹ˆ๋‹ค. (๋’ท๋ถ€๋ถ„์—์„œ ๋” ์ž์„ธํžˆ ์•Œ์•„๋ด์š”!)

โœจโœจ ์กฐ๊ธˆ ๋” ์ž์„ธํžˆ ์„ค๋ช…ํ•˜์ž๋ฉด:

  • ์ด๋ฏธ์ง€๋ฅผ ์ธ์ฝ”๋”ฉํ•˜๋Š” Image Encoder ๋Š”, Vision Transformer๋กœ, ์ด ๋…ผ๋ฌธ์—์„œ๋Š” resnet๊ณผ ViT๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ํ…Œ์ŠคํŠธํ–ˆ๊ณ , ์ตœ์ข…์ ์œผ๋กœ CLIP-ViT-B/32 ๋ชจ๋ธ์ด ๊ฐ€์žฅ ์ข‹๋‹ค๊ณ  ํ‰๊ฐ€ํ–ˆ์Šต๋‹ˆ๋‹ค!

๐Ÿ–ผ๏ธ CLIP Image Encoder (ViT / ResNet ๊ธฐ๋ฐ˜)

ํ•ญ๋ชฉ๋‚ด์šฉ
์‚ฌ์šฉ๋œ ์•„ํ‚คํ…์ฒ˜Vision Transformer (ViT) ๋ฐ ResNet ๊ณ„์—ด
ViT ๋ชจ๋ธ ์ข…๋ฅ˜ViT-B/32, ViT-B/16, ViT-L/14, ViT-L/14@336px
ResNet ๋ชจ๋ธ ์ข…๋ฅ˜ResNet-50, ResNet-101, RN50x4, RN50x16, RN50x64
์ž…๋ ฅ ํ•ด์ƒ๋„๊ธฐ๋ณธ 224ร—224, ์ผ๋ถ€ ๋ชจ๋ธ์€ 336ร—336
ViT ๊ตฌ์กฐ ํŠน์ง•์ด๋ฏธ์ง€ โ†’ ํŒจ์น˜ ๋ถ„ํ•  โ†’ Linear Embedding + Positional Encoding
ResNet ๊ตฌ์กฐ ํŠน์ง•ํ‘œ์ค€ Conv + Residual block ๊ธฐ๋ฐ˜ CNN
ํŠน์„ฑ ์ถ”์ถœ ๋ฐฉ์‹ViT: [CLS] ํ† ํฐ ์‚ฌ์šฉ
ResNet: ๊ธ€๋กœ๋ฒŒ ํ‰๊ท  ํ’€๋ง
์ถœ๋ ฅ ์ž„๋ฒ ๋”ฉ ์ฐจ์›512~1024 (๋ชจ๋ธ์— ๋”ฐ๋ผ ๋‹ค๋ฆ„)
ํ•™์Šต ์‹œ๊ฐ„ ์˜ˆ์‹œViT-L/14: 12์ผ (256ร— V100), RN50x64: 18์ผ (592ร— V100)
  • ํ…์ŠคํŠธ๋ฅผ ์ธ์ฝ”๋”ฉํ•˜๋Š” Text Encoder, ์šฐ๋ฆฌ๊ฐ€ ์ž˜ ์•Œ๊ณ ์žˆ๋Š” Transformer๋กœ ์™ผ์ชฝ์—์„œ ์˜ค๋ฅธ์ชฝ์œผ๋กœ ์ฒ˜๋ฆฌํ•˜๋Š” causal LM ๋ฐฉ์‹์œผ๋กœ ํ…์ŠคํŠธ๋ฅผ ๋ฒกํ„ฐํ™”ํ•˜์˜€์Šต๋‹ˆ๋‹ค!

๐Ÿง  CLIP Text Encoder (Transformer ๊ธฐ๋ฐ˜)

ํ•ญ๋ชฉ๋‚ด์šฉ
์•„ํ‚คํ…์ฒ˜Transformer (GPT-style)
๋ ˆ์ด์–ด ์ˆ˜12 layers
Hidden size512
Attention Heads8
ํŒŒ๋ผ๋ฏธํ„ฐ ์ˆ˜์•ฝ 63M
์–ดํœ˜ ์‚ฌ์ „Byte Pair Encoding (BPE), vocab size 49,152
์ž…๋ ฅ ๊ธธ์ด ์ œํ•œ76 tokens + [SOS], [EOS]
์ž…๋ ฅ ํ˜•์‹[SOS] + BPE tokens + [EOS]
ํŠน์„ฑ ์ถ”์ถœ ๋ฐฉ์‹[EOS] ์œ„์น˜์˜ ๋งˆ์ง€๋ง‰ ๋ ˆ์ด์–ด ์ถœ๋ ฅ ์‚ฌ์šฉ
ํ›„์ฒ˜๋ฆฌLayerNorm โ†’ Linear projection
Attention MaskMasked Self-Attention (GPT-style)
  • ์ด ๋‘˜์„ ๊ฐ™์€ ๊ณต๊ฐ„์œผ๋กœ ๋งคํ•‘ํ•˜์—ฌ ์„œ๋กœ ์ž˜ ๋งž๋Š” ์Œ์€ ๊ฐ€๊นŒ์ด, ์•„๋‹Œ ์Œ์€ ๋ฉ€๋ฆฌ ๋–จ์–ด์ง€๊ฒŒ ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค.

์ด ๊ณผ์ •์„ โ€œContrastive Learningโ€์ด๋ผ๊ณ  ํ•ฉ๋‹ˆ๋‹ค.


๐Ÿ”ง ์–ด๋–ป๊ฒŒ ํ•™์Šต๋˜์—ˆ์„๊นŒ?

CLIP์€ ์ธํ„ฐ๋„ท์—์„œ ์ˆ˜์ง‘ํ•œ 4์–ต ์Œ์˜ ์ด๋ฏธ์ง€โ€“ํ…์ŠคํŠธ ๋ฐ์ดํ„ฐ๋กœ ํ›ˆ๋ จ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.
32,768๊ฐœ์˜ batch๋กœ ํ•™์Šต๋˜์—ˆ๊ณ , 32 epochs์„ ์ง„ํ–‰ํ–ˆ๋‹ค๊ณ ํ•ฉ๋‹ˆ๋‹ค.


์ง€๊ธˆ๊นŒ์ง€์˜ ๋‚ด์šฉ์ด OpenAI์˜ ํ”„๋ ˆ์  ํ…Œ์ด์…˜ ์ž๋ฃŒ์— ์ž˜ ์š”์•ฝ๋˜์–ด์žˆ์Šต๋‹ˆ๋‹ค!

presentation

๐Ÿง  Contrastive Pretraining์— ๋Œ€ํ•˜์—ฌ ๋” ์•Œ์•„๋ณด๊ธฐ!!

contrastive_pretraining Contrastive pretraining์€ ๋ชจ๋ธ์ด ๋น„์Šทํ•œ ์Œ์€ ๊ฐ€๊น๊ฒŒ, ๋‹ค๋ฅธ ์Œ์€ ๋ฉ€๊ฒŒ ์ž„๋ฒ ๋”ฉ ๊ณต๊ฐ„์—์„œ ํ•™์Šตํ•˜๋„๋ก ํ•˜๋Š” ์‚ฌ์ „ ํ•™์Šต ๊ธฐ๋ฒ•์ž…๋‹ˆ๋‹ค.
์ฃผ๋กœ ์ด๋ฏธ์ง€โ€“ํ…์ŠคํŠธ, ์ด๋ฏธ์ง€โ€“์ด๋ฏธ์ง€, ๋ฌธ์žฅโ€“๋ฌธ์žฅ์ฒ˜๋Ÿผ ์ง์„ ์ด๋ฃจ๋Š” ๋ฐ์ดํ„ฐ ์Œ์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•™์Šตํ•˜๋ฉฐ,
ํ‘œํ˜„(embedding) ํ•™์Šต์— ๋งค์šฐ ํšจ๊ณผ์ ์ž…๋‹ˆ๋‹ค.

์˜ˆ-์ด๋ฏธ์ง€ : โ€œ๊ณ ์–‘์ด ์‚ฌ์ง„โ€ -> CLIP ๋ชจ๋ธ -> (0,0,0,0.99) ์˜ˆ-ํ…์ŠคํŠธ : โ€œa photo of a catโ€ -> CLIP ๋ชจ๋ธ -> (0,0,0,0.98) ์œ„์™€ ๊ฐ™์ด ์ด๋ฏธ์ง€์™€ ํ…์ŠคํŠธ์˜ ๋ฒกํ„ฐ๋“ค์ด ์„œ๋กœ ๊ฐ€๊นŒ์šด ๋ฒกํ„ฐ๊ฐ€ CLIP ๋ชจ๋ธ์„ ํ•™์Šต์‹œํ‚ค๊ธฐ!!


๐Ÿ“‰ Contrastive Learning์—์„œ ์‚ฌ์šฉํ•˜๋Š” Loss: InfoNCE Loss

๋Œ€์กฐ ํ•™์Šต์—์„œ๋Š” InfoNCE Loss(Noise-Contrastive Estimation ๊ธฐ๋ฐ˜ ์†์‹ค)๋ฅผ ์ฃผ๋กœ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

โœ… InfoNCE Loss๋ž€?

  • Positive Pair: ์„œ๋กœ ๊ด€๋ จ ์žˆ๋Š” ์Œ (์˜ˆ: ์ด๋ฏธ์ง€์™€ ๊ทธ ์„ค๋ช…)
  • Negative Pairs: ๋‚˜๋จธ์ง€ ๋ฌด๊ด€ํ•œ ๋ชจ๋“  ์Œ

๋ชจ๋ธ์€ positive pair์˜ ์œ ์‚ฌ๋„(similarity)๋Š” ๋†’์ด๊ณ ,
negative pair์˜ ์œ ์‚ฌ๋„๋Š” ๋‚ฎ์ถ”๋Š” ๋ฐฉํ–ฅ์œผ๋กœ ํ•™์Šต๋ฉ๋‹ˆ๋‹ค.

๐Ÿ“ ์ˆ˜์‹ ๊ฐœ์š”

 L = -log( exp(sim(x, yโบ) / ฯ„) / ฮฃแตข exp(sim(x, yแตข) / ฯ„) ) 
  • sim(x, y): ์ด๋ฏธ์ง€, ํ…์ŠคํŠธ ๋ฒกํ„ฐ์˜ ์ฝ”์‚ฌ์ธ์œ ์‚ฌ๋„!!
  • yโบ: ์˜ฌ๋ฐ”๋ฅธ ํ…์ŠคํŠธ ์Œ
  • yแตข: ๊ฐ™์€ ๋ฐฐ์น˜ ๋‚ด ๋‹ค๋ฅธ ํ…์ŠคํŠธ๋“ค (negative)
  • ฯ„: temperature scaling factor (๋ณดํ†ต 0.07)

InfoNCE๋Š” ๊ฒฐ๊ตญ ์ •๋‹ต ์Œ์ด ์ „์ฒด ์ค‘ ์–ผ๋งˆ๋‚˜ ์ƒ๋Œ€์ ์œผ๋กœ ์ž˜ ๋งž๋Š”์ง€๋ฅผ ํ™•๋ฅ ์ฒ˜๋Ÿผ ๋ชจ๋ธ๋งํ•˜์—ฌ ํ•™์Šตํ•˜๋Š” ๋ฐฉ์‹์ž…๋‹ˆ๋‹ค.

๐Ÿ”ง ์‹ค์ œ ํ™œ์šฉ ์˜ˆ์‹œ

  • CLIP: ์ด๋ฏธ์ง€โ€“ํ…์ŠคํŠธ ์ž„๋ฒ ๋”ฉ ์ •๋ ฌ
  • SimCLR: ์ด๋ฏธ์ง€โ€“์ด๋ฏธ์ง€ Augmented Pair
  • ALIGN: ์ด๋ฏธ์ง€โ€“์บก์…˜ ๋งค์นญ
  • DINO, MoCo: ์ž๊ฐ€ ์ง€๋„ ํ•™์Šต ๊ธฐ๋ฐ˜ ์ž„๋ฒ ๋”ฉ ํ•™์Šต

๐ŸŽฏ ์š”์•ฝ

๊ฐœ๋…์„ค๋ช…
๋ชฉ์ Positive๋Š” ๊ฐ€๊น๊ฒŒ, Negative๋Š” ๋ฉ€๊ฒŒ
์†์‹ค ํ•จ์ˆ˜InfoNCE Loss
ํŠน์ง•๋ผ๋ฒจ ์—†์ด๋„ ์œ ์‚ฌ์„ฑ ๊ธฐ๋ฐ˜ ํ•™์Šต ๊ฐ€๋Šฅ
์‘์šฉ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ, ๊ฒ€์ƒ‰, Zero-shot ๋“ฑ ๋‹ค์–‘ํ•œ ํƒœ์Šคํฌ

๐Ÿ’ก CLIP์˜ ํ™œ์šฉ ์˜ˆ์‹œ

CLIP์€ ๋‹จ์ˆœํ•œ ์ด๋ฏธ์ง€ ๋ถ„๋ฅ˜๋ฅผ ๋„˜์–ด ๋‹ค์–‘ํ•œ ๋ฐฉ์‹์œผ๋กœ ํ™œ์šฉ ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค:

ํ™œ์šฉ์„ค๋ช…
๐Ÿ–ผ๏ธ Zero-shot ์ด๋ฏธ์ง€ ๋ถ„๋ฅ˜์‚ฌ์ „ ์ •์˜๋œ ํด๋ž˜์Šค ์—†์ด ํ…์ŠคํŠธ๋งŒ์œผ๋กœ ๋ถ„๋ฅ˜ ๊ฐ€๋Šฅ ("a photo of a dog", "a photo of a cat")
๐Ÿ” ํ…์ŠคํŠธ ๊ธฐ๋ฐ˜ ์ด๋ฏธ์ง€ ๊ฒ€์ƒ‰"a person riding a horse"์™€ ๊ฐ€์žฅ ์ž˜ ๋งž๋Š” ์ด๋ฏธ์ง€๋ฅผ ๊ฒ€์ƒ‰
๐ŸŽจ ํ…์ŠคํŠธ โ†’ ์ด๋ฏธ์ง€ ์ƒ์„ฑ ๋ณด์กฐDALLยทE ๊ฐ™์€ ์ƒ์„ฑ ๋ชจ๋ธ์˜ ๋ณด์กฐ ์—ญํ• 
๐Ÿงช ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ์—ฐ๊ตฌ ๊ธฐ๋ฐ˜์ด๋ฏธ์ง€์™€ ํ…์ŠคํŠธ๋ฅผ ํ•จ๊ป˜ ๋‹ค๋ฃจ๋Š” ์—ฐ๊ตฌ์˜ ์ถœ๋ฐœ์ ์œผ๋กœ ์ž์ฃผ ์“ฐ์ž„

์˜ˆ์‹œ!: ์ด๋ฏธ์ง€ ์Šคํƒ€์ผ๋ฐ”๊พธ๊ธฐ

์ตœ๊ทผ์—” ์ต์ˆ™ํ•ด์ง„ ์ด๋ฏธ์ง€ ์Šคํƒ€์ผ ๋ฐ”๊พธ๊ธฐ!! ๋„๊ฒฐ๊ตญ ์ด CLIP์—์„œ ์‹œ์ž‘๋ฌ๋‹ค๊ณ  ๋ณผ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค~! style_change โ€”

๐Ÿ” ๊ธฐ์กด ๋ชจ๋ธ๊ณผ์˜ ์ฐจ์ด์ 

ํ•ญ๋ชฉ๊ธฐ์กด ์ด๋ฏธ์ง€ ๋ถ„๋ฅ˜ ๋ชจ๋ธ (ViT, Resnet)CLIP
์ž…๋ ฅ์ด๋ฏธ์ง€๋งŒ ์‚ฌ์šฉ์ด๋ฏธ์ง€ + ํ…์ŠคํŠธ
๋ถ„๋ฅ˜ ๊ธฐ์ค€๊ณ ์ •๋œ ๋ผ๋ฒจ(class)์ž์œ ๋กœ์šด ์ž์—ฐ์–ด
ํ™•์žฅ์„ฑํด๋ž˜์Šค ์ถ”๊ฐ€ ์‹œ ์žฌํ•™์Šต ํ•„์š”๋ฌธ์žฅ๋งŒ ๋ฐ”๊พธ๋ฉด Zero-shot ์ ์šฉ ๊ฐ€๋Šฅ

๐Ÿ“ˆ CLIP์ด ์ค‘์š”ํ•œ ์ด์œ 

  1. ๋ฒ”์šฉ์„ฑ: ํ•œ ๋ฒˆ ํ›ˆ๋ จ๋œ ๋ชจ๋ธ๋กœ ๋‹ค์–‘ํ•œ ํƒœ์Šคํฌ์— ์ ์šฉ ๊ฐ€๋Šฅ
  2. Zero-shot ์„ฑ๋Šฅ: ์ƒˆ๋กœ์šด ํด๋ž˜์Šค๋ฅผ ์žฌํ•™์Šต ์—†์ด ์ฒ˜๋ฆฌ ๊ฐ€๋Šฅ
  3. ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ AI์˜ ์‹œ์ž‘์ : ํ…์ŠคํŠธ์™€ ์ด๋ฏธ์ง€์˜ ๊ณต๋™ ํ‘œํ˜„ ๊ณต๊ฐ„์„ ๋‹ค๋ฃจ๋Š” ๊ธฐ๋ฐ˜์ด ๋จ
  4. ํ…์ŠคํŠธ ๊ธฐ๋ฐ˜ ์ œ์–ด: ์‚ฌ์šฉ์ž๊ฐ€ ์›ํ•˜๋Š” ์ด๋ฏธ์ง€๋ฅผ ํ…์ŠคํŠธ๋กœ ์ œ์‹œํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•จ

๐Ÿง  CLIP์˜ ํ•œ๊ณ„์ ์€?

  • Bias ๋ฌธ์ œ: ์›น ๋ฐ์ดํ„ฐ ๊ธฐ๋ฐ˜ ํ•™์Šต์ด๋ผ, ์ธ๊ฐ„์˜ ํŽธํ–ฅ์ด ๋ชจ๋ธ์—๋„ ๋ฐ˜์˜๋  ์ˆ˜ ์žˆ์Œ
  • ์„ธ๋ฐ€ํ•œ ๋ฌธ์žฅ ๊ตฌ๋ถ„์€ ์–ด๋ ค์›€: "a man riding a horse"์™€ "a horse riding a man"์„ ๋ช…ํ™•ํžˆ ๊ตฌ๋ถ„ํ•˜์ง€ ๋ชปํ•  ์ˆ˜ ์žˆ์Œ
  • ํ…์ŠคํŠธ์— ์ง€๋‚˜์น˜๊ฒŒ ์˜์กด: ์‹œ๊ฐ ์ •๋ณด๋งŒ์œผ๋กœ๋Š” ์ฒ˜๋ฆฌ ๊ฐ€๋Šฅํ•œ ๊ฒฝ์šฐ์—๋„ ํ…์ŠคํŠธ์— ์˜์กดํ•  ์ˆ˜ ์žˆ์Œ
  • Typographic Attack : ์ด๋ฏธ์ง€์— ํ…์ŠคํŠธ๊ฐ€ ๋“ค์–ด๊ฐ€์žˆ์œผ๋ฉด ์ž˜๋ชป์ธ์‹ํ•จ!! typo_attack

    ์‚ฌ๊ณผ์ด์ง€๋งŒ iPod๋ผ๋Š” ํ…์ŠคํŠธ๊ฐ€ ์จ์žˆ์œผ๋‹ˆ iPod๋กœ ์ธ์‹ํ•˜๋Š” ํ•œ๊ณ„๊ฐ€ ๋ฐœ๊ฒฌ๋˜์—ˆ์Šต๋‹ˆ๋‹ค!!


๐Ÿ”— ์ฐธ๊ณ  ์ž๋ฃŒ ๋ฐ ์ฝ”๋“œ


โœ๏ธ ๋งˆ๋ฌด๋ฆฌํ•˜๋ฉฐ

CLIP์€ ๋‹จ์ˆœํ•œ ์ด๋ฏธ์ง€ ๋ถ„๋ฅ˜๋ฅผ ๋„˜์–ด์„œ, ์ด๋ฏธ์ง€์™€ ํ…์ŠคํŠธ๋ฅผ ์—ฐ๊ฒฐํ•˜๋Š” ์‚ฌ๊ณ ๋ฐฉ์‹ ์ž์ฒด๋ฅผ ๋ฐ”๊พผ ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค.
์˜ค๋Š˜๋‚  DALLยทE, Stable Diffusion, Flamingo ๋“ฑ ๋‹ค์–‘ํ•œ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ๋ชจ๋ธ์˜ ๊ธฐ๋ฐ˜์ด ๋˜์—ˆ์ฃ .

๐Ÿ‘‰ ์•ž์œผ๋กœ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ AI์— ๊ด€์‹ฌ์ด ์žˆ๋‹ค๋ฉด, CLIP์€ ๊ผญ ์ดํ•ดํ•˜๊ณ  ๋„˜์–ด๊ฐ€์•ผ ํ•  ํ•ต์‹ฌ ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค!


๊ฐ์‚ฌํ•ฉ๋‹ˆ๋‹ค! ๊ถ๊ธˆํ•œ ์ ์ด๋‚˜ ๋” ์•Œ๊ณ  ์‹ถ์€ ์ฃผ์ œ๊ฐ€ ์žˆ๋‹ค๋ฉด ๋Œ“๊ธ€๋กœ ๋‚จ๊ฒจ์ฃผ์„ธ์š” ๐Ÿ’ฌ

This post is licensed under CC BY 4.0 by the author.