Post

๐Ÿ“ ViT, you can do greater things! - The emergence of DINO!! // ViT, ๋„ˆ๋Š” ๋” ํฐ์ผ์„ ํ• ์ˆ˜์žˆ์–ด! - DINO์˜ ๋“ฑ์žฅ!! (ICCV 2021)

๐Ÿ“ ViT, you can do greater things! - The emergence of DINO!! // ViT, ๋„ˆ๋Š” ๋” ํฐ์ผ์„ ํ• ์ˆ˜์žˆ์–ด! - DINO์˜ ๋“ฑ์žฅ!! (ICCV 2021)

(English) ๐Ÿง  What is DINO?

Study of โ€œEmerging Properties in Self-Supervised Vision Transformersโ€ (ICCV, 2021)

paper

๐Ÿ“– Paper Title: Emerging Properties in Self-Supervised Vision Transformers
โœ๏ธ Authors: Facebook AI Research (Caron, Mathilde, et al.)
๐ŸŒŸ One-line Summary: A core model where ViT learns by creating its own teacher and student models without labels!


๐Ÿ“š Core Idea

manhwa_en

  • DINO stands for Distillation with No Labels!
  • It enables a student network to mimic the teacher networkโ€™s output
    without any external labels, in a self-supervised learning manner!
  • DINO acts more as an encoder that transforms images into vectors, rather than merely performing image classification.
  • Teacher and Student models are trained together, but ultimately the Student model is used.
  • The Teacher observes the full view of the image,
  • The Student sees cropped or transformed views and tries to learn the same information!
  • The Teacher model slowly updates together with the Studentโ€™s improvements!

๐Ÿ” Background of DINO

  • Limitations of Supervised Learning:
    • With the rise of ViT, image classification advanced significantly.
    • However, Vision models still heavily relied on large, labeled datasets like ImageNet.
    • Labeling is costly and error-prone, especially in certain domains.
  • Limitations of Previous Self-Supervised Models:
    • Prior to DINO, most image SSL methods were CNN-based.
    • CNNs mainly capture local information, while ViT can capture global context.
  • Therefore!! It was perfect timing for a model that could self-supervise images using ViT!

๐Ÿ” How DINO Works

teacher_student

1. Two Networks: Student and Teacher

  • Both networks consist of a ViT Encoder + MLP Projection Head.
  • The ViT backbone starts with randomly initialized weights!
  • Note: ResNet can replace ViT as the backbone too!
1
2
3
4
5
6
7
8
9
10
11
[Input Image]
    โ†“
Patchify (for ViT)
    โ†“
ViT Backbone (or ResNet)
    โ†“
[CLS] Token Output
    โ†“
MLP Projection Head
    โ†“
Output Vector (for contrastive-like loss)

2. View Generation: Applying various augmentations to create multiple training views from the same image

  • Create 2 Global Views and
  • 6 Local Views
  • Thus, a total of 8 different augmented views are generated from a single image.

  • ๐ŸŒ Global Views
ItemDescription
Quantity2
Size224 ร— 224 (same as ViT input size)
UsageBoth Student and Teacher use them (Teacher uses only global views)
AugmentationsRandom resized crop, color jittering, blur, and other strong augmentations
  • ๐Ÿ”Ž Local Views
ItemDescription
QuantityTypically 6 (varies depending on the experiment)
SizeSmall crops like 96 ร— 96
UsageOnly used by the Student
AugmentationsRandom crop, color jitter, grayscale conversion, blur, etc.
PurposeTrain the model to infer the full concept even from partial views
  • โœจ Main Augmentation Techniques
AugmentationDescription
Random Resized CroppingRandomly crop and resize images to allow seeing diverse parts of the image.
Color Jittering (brightness, contrast, saturation, hue)Randomly change brightness, contrast, saturation, and hue to vary color characteristics.
Random GrayscaleRandomly convert images to grayscale, helping the model rely less on color.
Gaussian BlurApply blur to reduce sharpness, making the model robust to low-clarity images.
SolarizationInvert the bright regions of an image, adding variance.
Horizontal FlipFlip the image horizontally.
NormalizationNormalize pixel values to mean 0 and standard deviation 1 to stabilize training.

3. Prediction Alignment: The Student learns to match the Teacherโ€™s output distribution from different views

Even though the Student and Teacher see different versions of the image,
they should produce the same output vector!
This is the core Self-Distillation approach of DINO.

  • ๐Ÿ” Core Idea
    • In self-supervised learning, no ground truth labels are available.
    • Different augmented views of the same image are created and fed into the Student and Teacher separately.
    • The Teacherโ€™s output vector is treated as a โ€œpseudo-labelโ€.
    • The Student is trained to predict as close as possible to the Teacherโ€™s output.
  • ๐Ÿง  Process Flow

    1. Create two different views of the same image.
    2. View 1 โ†’ Input to Teacher โ†’ Generate fixed output vector.
    3. View 2 โ†’ Input to Student โ†’ Generate trainable output vector.
    4. Minimize the Cross-Entropy loss between the Studentโ€™s and Teacherโ€™s outputs.
    Teacher output: t = softmax(h_T(xโ‚) / ฯ„_T)
    Student output: s = softmax(h_S(xโ‚‚) / ฯ„_S)
    Loss = cross_entropy(t, s)
    
  • ฮธ_T : Teacherโ€™s parameters
  • ฮธ_S : Studentโ€™s parameters
  • m : Momentum coefficient (typically between 0.996 and 0.999)

  • While the Student learns quickly via Backpropagation,
  • the Teacher and Student have identical architectures,
  • thus each of the Teacherโ€™s weights is updated following the formula above!

๐Ÿš€ Reconfirming the Importance of DINO

  • No labels required: Capable of learning high-quality features without manual annotation.
  • Versatility: While traditional ViTs were limited to classification,
    DINO expands the use to image segmentation, image clustering, image retrieval, and more!
  • Potential of ViT: Thanks to DINO, ViT shows much greater potential as a general-purpose encoder!

๐Ÿ“ˆ Key Achievements of DINO

classification

  • DINO demonstrated strong classification performance on ImageNet!
    • But wait โ€” DINO is an encoder! How does it classify?
    • It uses the Linear Probing approach! (will be explained in detail later!)
  • Additionally, DINO showed outstanding strengths in clustering, retrieval, detection, and many more tasks!
  • Through Ablation Studies, the contributions of each component were carefully verified! (details explained later!)

โœ… Linear Probing: Evaluating how linearly separable DINOโ€™s features are

  1. Freeze the ViT encoder trained by DINO.
  2. Add a simple Linear Classifier on top.
  3. Train only the Linear layer using ImageNet labels.
  4. Measure ImageNet Top-1 Accuracy to evaluate the quality of extracted features.
  • DINOโ€™s features can almost segment major regions in images without supervision.
  • DINO works effectively across various architectures (ResNet, ViT, hybrid).

๐Ÿงช Ablation Study: Analyzing the Impact of Each Component

  • In the Ablation Study, the following components were added or removed to analyze their effect on model performance:
ComponentMeaningImpact of Addition/Removal
Mom (Momentum Encoder)Updates the Teacherโ€™s parameters via EMA based on StudentWithout it โ†’ Teacher doesnโ€™t learn โ†’ Model collapse, severe performance drop. A critical component.
SK (Sinkhorn Normalization)Normalizes the output distribution evenlyHelps prevent collapse only if no momentum. Not necessary if momentum is present.
MC (Multi-Crop)Creates multiple views of different scales from an imageSignificantly improves feature quality, highly important.
CE (Cross-Entropy Loss)Aligns the Studentโ€™s and Teacherโ€™s output distributionsThe core loss function in DINO, removing it degrades performance.
Pred (Predictor)A small MLP added to the Studentโ€™s outputHas minimal effect in DINO. Was critical in BYOL.

โœจ Final Thoughts

In todayโ€™s multimodal AI era,
transforming images into meaningful vector representations has become very natural.

โœ… DINO is a remarkable study that opened up the limitless potential of ViT,
โœ… enabling powerful learning without requiring labels!

Recently, with the rise of techniques like knowledge distillation (e.g., Deepseek),
the ability to train two models together in a self-supervised manner is gaining renewed attention.

Interestingly, this line of research traces back to BYOL (Bootstrap Your Own Latent),
a pioneering study originally proposed by DeepMind.

Next, I definitely plan to dive into studying BYOL as well! ๐Ÿ˜Š


(ํ•œ๊ตญ์–ด) ๐Ÿง  ViT, ๋„ˆ๋Š” ๋” ํฐ์ผ์„ ํ• ์ˆ˜์žˆ์–ด! - DINO์˜ ๋“ฑ์žฅ!!

ใ€ŽEmerging Properties in Self-Supervised Vision Transformersใ€(ICCV, 2021) ๊ณต๋ถ€

paper

๐Ÿ“– ๋…ผ๋ฌธ ์ œ๋ชฉ: Emerging Properties in Self-Supervised Vision Transformers
โœ๏ธ ์ €์ž: Facebook AI Research (Caron, Mathilde, et al)
๐ŸŒŸ ํ•œ์ค„ ์š”์•ฝ: Label ์—†์ด ์Šค์Šค๋กœ ์„ ์ƒ๋‹˜๊ณผ ํ•™์ƒ๋ชจ๋ธ์„ ๋งŒ๋“ค์–ด ํ•™์Šตํ•œ ViT์˜ ํ•ต์‹ฌ ๋ชจ๋ธ!!


๐Ÿ“š ํ•ต์‹ฌ ์•„์ด๋””์–ด

manhwa

  • DINO๋ž€ Distillation with No Labels์˜ ์•ฝ์ž๋กœ!,
  • ๋ณ„๋„์˜ ๋ ˆ์ด๋ธ” ์—†์ด!!
  • ํ•™์ƒ ๋„คํŠธ์›Œํฌ๊ฐ€ ๊ต์‚ฌ ๋„คํŠธ์›Œํฌ์˜ ์ถœ๋ ฅ์„ ๋ชจ๋ฐฉํ•˜๋„๋ก ํ•™์Šตํ•˜๋Š” Self-Supervised Learning ๋ฐฉ๋ฒ•!!
  • ViT์™€ ๊ฐ™์€ Image classification์ด๋ผ๊ธฐ๋ณด๋‹จ, ์ด๋ฏธ์ง€๋ฅผ ๋ฒกํ„ฐ๋กœ ๋ฐ”๊พธ๋Š” ์ธ์ฝ”๋”๋กœ์„œ ์—ญํ• ์„ํ•œ๋‹ค.
  • ๊ต์‚ฌ์™€ ํ•™์ƒ๋ชจ๋ธ๋กœ ๊ตฌ๋ถ„๋˜์„œ ํ•™์Šต๋˜๊ณ , ์ตœ์ข… ๋ชจ๋ธ์€ ํ•™์Šต๋ชจ๋ธ์ด ์‚ฌ์šฉ๋œ๋‹ค!
  • ๊ต์‚ฌ ๋ชจ๋ธ์€ ํฐ ๊ทธ๋ฆผ์„ ๋ณด๊ณ  ์žˆ์œผ๋ฉฐ,
  • ํ•™์ƒ ๋ชจ๋ธ์€ ์ž˜๋ฆฌ๊ฑฐ๋‚˜ ๋ณ€ํ™˜๋œ ์ด๋ฏธ์ง€๋ฅผ ๋ณด๊ณ ์žˆ๋Š”๋ฐ,
  • ํ•™์ƒ ๋ชจ๋ธ์ด ์ด ๋ณ€ํ™˜๋œ ์ด๋ฏธ์ง€์—์„œ ๊ต์‚ฌ ๋ชจ๋ธ์˜ ์ •๋ณด์™€ ๋™์ผํ•œ ๋ฐ์ดํ„ฐ๋ฅผ ์ถ”์ถœํ•˜๋„๋ก ํ•™์Šตํ•œ๋‹ค!!
  • ๊ต์‚ฌ๋ชจ๋ธ๋„ ํ•™์ƒ์˜ ๋ฐœ์ „๊ณผ ํ•จ๊ป˜ ์ฒœ์ฒœํžˆ ํ•™์Šตํ•œ๋‹ค!!

๐Ÿ” DINO ์—ฐ๊ตฌ์˜ ๋ฐฐ๊ฒฝ!

  • Supervised Learning์˜ ํ•œ๊ณ„!
    • ViT์˜ ๋“ฑ์žฅ์œผ๋กœ ์ด๋ฏธ์ง€ Classification์— ๋งŽ์€ ๋ฐœ์ „์ด ์žˆ์—ˆ์ง€๋งŒ!,
    • ์—ฌ์ „ํžˆ Vision ๋ชจ๋ธ์€ ImageNet ๋“ฑ ๋Œ€๊ทœ๋ชจ ๋ผ๋ฒจ๋ง๋œ ๋ฐ์ดํ„ฐ์…‹์— ์˜์กดํ•˜๊ณ ์žˆ์—ˆ๋‹ค.
    • ์ด๋Š” โ–ณ ๋ผ๋ฒจ๋ง ๋น„์šฉ์ด ํฌ๊ณ  โ–ณ ์ผ๋ถ€ ๋„๋ฉ”์ธ์—์„œ๋Š” ๋ผ๋ฒจ์ด ์•„์˜ˆ ์—†๊ฑฐ๋‚˜ ๋ถ€์ •ํ™•ํ•œ ๋ฌธ์ œ๊ฐ€ ์žˆ์—ˆ๋‹ค!!
  • Self-Supervised ๋ชจ๋ธ๋“ค์˜ ํ•œ๊ณ„ : CNN ๊ธฐ๋ฐ˜
    • ViT์ด ๋“ฑ์žฅํ•œ์ง€ ์–ผ๋งˆ ์•ˆ๋˜์—ˆ๊ธฐ์—, ์ด๋ฏธ ์žˆ๋Š” ์ด๋ฏธ์ง€์˜ Self-Supervised ๋Š” ๋ชจ๋‘ CNN ๊ธฐ๋ฐ˜์ด์—ˆ๋‹ค.
    • CNN์€ ์ง€์—ญ์  ์ •๋ณด์—์— ์˜์กดํ•˜๊ธฐ์—, Transformer์˜ ํŠน์ง•์ธ ์ „์—ญ ์ •๋ณด๋ฅผ ์ž˜ ํ™œ์šฉํ•˜์ง€ ๋ชปํ–ˆ๋‹ค.
  • ๊ทธ๋ž˜์„œ!! ์ด๋ฏธ์ง€๋ฅผ!, Self-Supervised ๋ฐฉ์‹์œผ๋กœ ํ•™์Šตํ• ์ˆ˜ ์žˆ๋Š” ๋ชจ๋ธ์ด ํ•„์š”ํ•œ ํƒ€์ด๋ฐ์ด์—ˆ๋‹ค!!!

๐Ÿ” DINO์˜ ์ž‘๋™ ์›๋ฆฌ

teacher_student

1. ๋‘ ๊ฐœ์˜ ๋„คํŠธ์›Œํฌ: ํ•™์ƒ(Student)๊ณผ ๊ต์‚ฌ(Teacher) ๋„คํŠธ์›Œํฌ.

  • ๊ฐ ๋„คํŠธ์›Œํฌ๋Š” ViT ์ธ์ฝ”๋” + MLP Projection Head๋กœ ๊ตฌ์„ฑ๋˜์—ˆ๋‹ค!!
  • ViT Backbone์ด๋ž€ randon initialized๋œ ViT์˜ ๊ฐ€์ค‘์น˜๋ฅผ ์‚ฌ์šฉํ•ด์„œ ๋งŒ๋“ฆ!!
  • ViT ์ธ์ฝ”๋”๋Š” CNN ์ธ์ฝ”๋”๋กœ๋„ ๋Œ€์ฒด๊ฐ€ ๋œ๋‹ค๋Š” ๋†€๋ผ์šด ์‚ฌ์‹ค!!
1
2
3
4
5
6
7
8
9
10
11
12
  [Input Image]
      โ†“
  Patchify (for ViT)
      โ†“
  ViT Backbone (or ResNet)
      โ†“
  [CLS] Token Output
      โ†“
  MLP Projection Head
      โ†“
  Output Vector (for contrastive-like loss)

2. ๋ทฐ ์ƒ์„ฑ: ๋™์ผํ•œ ์ด๋ฏธ์ง€์— ๋‹ค์–‘ํ•œ ์ฆ๊ฐ•์„ ์ ์šฉํ•ด ์—ฌ๋Ÿฌ ํ•™์Šต์šฉ ๋ฒ„์ „(โ€œviewsโ€)์„ ๋งŒ๋“ฆ

  • ์ด 2๊ฐœ์˜ ๊ธ€๋กœ๋ฒŒ ๋ทฐ(global views) ์™€
  • 6๊ฐœ์˜ ๋กœ์ปฌ ๋ทฐ(local views) ๋ฅผ ์ƒ์„ฑ
  • ์ฆ‰, ํ•œ ์ด๋ฏธ์ง€์—์„œ ์ด 8๊ฐœ์˜ ์„œ๋กœ ๋‹ค๋ฅธ ์ฆ๊ฐ• ์ด๋ฏธ์ง€๋ฅผ ๋งŒ๋“ฆ
  • ๐ŸŒ ๊ธ€๋กœ๋ฒŒ ๋ทฐ (Global Views)

    ํ•ญ๋ชฉ์„ค๋ช…
    ์ˆ˜๋Ÿ‰2๊ฐœ
    ํฌ๊ธฐ224 ร— 224 (ViT ์ž…๋ ฅ ์‚ฌ์ด์ฆˆ์™€ ๋™์ผ)
    ์šฉ๋„Student & Teacher ๋ชจ๋‘ ์‚ฌ์šฉ (Teacher๋Š” ์˜ค์ง ๊ธ€๋กœ๋ฒŒ ๋ทฐ๋งŒ ์ž…๋ ฅ๋ฐ›์Œ)
    ์ฆ๊ฐ•๋žœ๋ค resized crop, ์ƒ‰์ƒ ์™œ๊ณก, blur ๋“ฑ ๊ฐ•ํ•œ ์ฆ๊ฐ• ํฌํ•จ
  • ๐Ÿ”Ž ๋กœ์ปฌ ๋ทฐ (Local Views)

    ํ•ญ๋ชฉ์„ค๋ช…
    ์ˆ˜๋Ÿ‰๋ณดํ†ต 6๊ฐœ (๋…ผ๋ฌธ์—์„œ ๋‹ค์–‘ํ•˜๊ฒŒ ์‹คํ—˜๋จ)
    ํฌ๊ธฐ96 ร— 96 ๋“ฑ ์†Œํ˜• crop
    ์šฉ๋„Student๋งŒ ์ž…๋ ฅ ๋ฐ›์Œ
    ์ฆ๊ฐ•๋žœ๋ค crop, ์ƒ‰์ƒ ์™œ๊ณก, grayscale, blur ๋“ฑ
    ๋ชฉ์ ์ž‘์€ ์˜์—ญ๋งŒ ๋ณด๊ณ ๋„ ์ „์ฒด ์ปจ์…‰์„ ์ถ”๋ก ํ•˜๋„๋ก ํ•™์Šต ์œ ๋„
  • โœจ ์ฃผ์š” ์ฆ๊ฐ• ๊ธฐ๋ฒ•

    ์ฆ๊ฐ• ๊ธฐ๋ฒ•์„ค๋ช…
    Random Resized Cropping์ด๋ฏธ์ง€๋ฅผ ๋ฌด์ž‘์œ„๋กœ ์ž๋ฅด๊ณ  ํฌ๊ธฐ ์กฐ์ •. ๊ฐ™์€ ์ด๋ฏธ์ง€๋ฅผ ๋‹ค์–‘ํ•œ ์‹œ์ ์—์„œ ๋ณด๋„๋ก ํ•˜๊ธฐ ์œ„ํ•ด ์‚ฌ์šฉ.
    Color Jittering (brightness, contrast, saturation, hue)๋ฐ๊ธฐ, ๋Œ€๋น„, ์ฑ„๋„, ์ƒ‰์กฐ ๋“ฑ์„ ๋ฌด์ž‘์œ„๋กœ ๋ณ€ํ™”. ์ƒ‰์ƒ ์˜์กด ๊ฐ์†Œ.
    Random Grayscale์ด๋ฏธ์ง€๋ฅผ ํ‘๋ฐฑ ์ „ํ™˜. ์ƒ‰์ด ์—†์–ด๋„ ์ธ์‹ํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ›ˆ๋ จ.
    Gaussian Blur์ด๋ฏธ์ง€์— ๋ธ”๋Ÿฌ ๋ถ€์—ฌ. ์„ ๋ช…ํ•˜์ง€ ์•Š์€ ์ดํ•ดํ•  ์ˆ˜ ์žˆ๋„๋ก ํ›ˆ๋ จ.
    Solarization๋ฐ์€ ๋ถ€๋ถ„์„ ๋ฐ˜์ „. ๋‹ค์–‘ํ•œ ๊ด‘๋Ÿ‰ ์กฐ๊ฑด์—์„œ ํ…Œ์ŠคํŠธ
    Horizontal Flip์ด๋ฏธ์ง€ ์ขŒ์šฐ ๋ฐ˜์ „. ๋ฐฉํ–ฅ์ด ๋ฐ”๋€Œ์–ด๋„ ๊ฐ™์€ ๋Œ€์ƒ์„ ์ธ์‹ํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•จ
    Normalization์ด๋ฏธ์ง€ ํ”ฝ์…€ ๊ฐ’์„ ํ‰๊ท  0, ํ‘œ์ค€ํŽธ์ฐจ 1๋กœ ์ •๊ทœํ™”. ํ•™์Šต ์•ˆ์ •์„ฑ๊ณผ ์†๋„๋ฅผ ๋†’์ด๊ธฐ ์œ„ํ•œ ๊ธฐ๋ณธ ์ฒ˜๋ฆฌ

3. ์˜ˆ์ธก ์ •๋ ฌ: ํ•™์ƒ์€ ์„œ๋กœ ๋‹ค๋ฅธ ๋ทฐ์—์„œ ๊ต์‚ฌ์˜ ์ถœ๋ ฅ ๋ถ„ํฌ(๊ฐ€์งœ ๋ ˆ์ด๋ธ”)๋ฅผ ๋งž์ถ”๋„๋ก ํ•™์Šต

์„ ์ƒ ๋ชจ๋ธ๊ณผ ํ•™์ƒ ๋ชจ๋ธ์ด ๋‹ค๋ฅธ ์ด๋ฏธ์ง€๋ฅผ ๋ดค์ง€๋งŒ ๊ฐ™์€ ๋ถ„์„๊ฒฐ๊ณผ(๋ฒกํ„ฐ)๋ฅผ ๋‚ด๋†”์•ผ ํ•ด!โ€
์ด๊ฒƒ์ด DINO์˜ ํ•ต์‹ฌ์ธ Self-Distillation ๋ฐฉ์‹!!

-๐Ÿ” ๊ฐœ๋… ์š”์•ฝ - Self-supervised ๋ฐฉ์‹์—์„œ๋Š” ์ •๋‹ต ๋ ˆ์ด๋ธ”์ด ์—†๊ธฐ์—
- ๊ฐ™์€ ์ด๋ฏธ์ง€์˜ ์„œ๋กœ ๋‹ค๋ฅธ ๋ทฐ(augmented views)๋ฅผ ๋งŒ๋“ค์–ด ๊ฐ๊ฐ Student์™€ Teacher์— ์ž…๋ ฅ
- Teacher์˜ ์ถœ๋ ฅ ๋ฒกํ„ฐ๋ฅผ ์ผ์ข…์˜ โ€œ๊ฐ€์งœ ์ •๋‹ต(label)โ€๋กœ ๋ณด๊ณ ,
- Student๊ฐ€ ์ด ์ถœ๋ ฅ์„ ์ตœ๋Œ€ํ•œ ๋น„์Šทํ•˜๊ฒŒ ์˜ˆ์ธกํ•˜๋„๋ก ํ•™์Šต!!!

  • ๐Ÿง  ๊ณผ์ • ํ๋ฆ„

    1. ๊ฐ™์€ ์ด๋ฏธ์ง€๋ฅผ ๋‘ ๊ฐ€์ง€ ๋ทฐ๋กœ ๋งŒ๋“ญ๋‹ˆ๋‹ค (View 1, View 2).
    2. View 1 โ†’ Teacher์— ์ž…๋ ฅ โ†’ ๊ณ ์ •๋œ output ๋ฒกํ„ฐ ์ƒ์„ฑ
    3. View 2 โ†’ Student์— ์ž…๋ ฅ โ†’ ํ•™์Šต ์ค‘์ธ output ๋ฒกํ„ฐ ์ƒ์„ฑ
    4. Student์˜ ์ถœ๋ ฅ์„ Teacher์˜ ์ถœ๋ ฅ๊ณผ ๋น„์Šทํ•˜๊ฒŒ ์ •๋ ฌํ•˜๋„๋ก Loss๋ฅผ ๊ณ„์‚ฐ(cross_entropy)
    5. ์ด๋•Œ Loss๋ฅผ ์ค„์ด๋Š” ๋ฐฉํ–ฅ์œผ๋กœ Student์˜ ๋ชจ๋ธ์ด ํ•™์Šต๋จ
      • ํ•™์Šต์˜ ๊ณผ์ •์„ ์ž์„ธํžˆ ๋ณด๋ฉด
      • Teacher ์ถœ๋ ฅ: t = softmax(h_T(xโ‚) / ฯ„_T)
      • Student ์ถœ๋ ฅ: s = softmax(h_S(xโ‚‚) / ฯ„_S)
      • Loss: Loss = cross_entropy(t, s)
      • ฯ„_T, ฯ„_Stemperature: ๋‚ฎ์„์ˆ˜๋ก ์ถœ๋ ฅ๊ฐ’์˜ ์˜ํ–ฅ์ด ์ปค์ง€๋‹ˆ ๋” ๋น ๋ฅด๊ฒŒ ๊ฐ€์ค‘์น˜ ๋ณ€ํ™”์— ์˜ํ–ฅ์„ ๋ฏธ์นจ

4. ๊ต์‚ฌ ์—…๋ฐ์ดํŠธ: ํ•™์ƒ ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ์„ ์ƒ๋‹˜๋„ ํ•™์Šต์„ ํ•œ๋‹ค!! ๋‹ค๋งŒ, ์ฒœ์ฒœํžˆ!!

teacher_study

๋งŒ์•ฝ ์„ ์ƒ๋‹˜์ด ์ง„๋„๋ฅผ ๋„ˆ๋ฌด ๋น ๋ฅด๊ฒŒ ๋‚˜๊ฐ€๋ฉด ํ•™์ƒ์ด ํ–‡๊ฐˆ๋ฆฌ๊ฒŸ์ง€์š”!?
์ด์—, ์ผ๋ฐ˜์ ์ธ ์ง€๋„ ํ•™์Šต๊ณผ ๋‹ฌ๋ฆฌ, DINO์˜ Teacher๋Š” ์ง์ ‘ ํ•™์Šต๋˜๋Š” ๊ฒŒ ์•„๋‹ˆ๋ผ
Student ๋„คํŠธ์›Œํฌ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ์ฒœ์ฒœํžˆ ์—…๋ฐ์ดํŠธ๋ฉ๋‹ˆ๋‹ค.

  • ํ•™์Šต๋˜๋Š” ๋ฐฉ๋ฒ•์€!? Student์˜ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์ง€์ˆ˜์ด๋™ํ‰๊ท (EMA, Exponential Moving Average)

    ฮธ_T โ† m ร— ฮธ_T + (1 - m) ร— ฮธ_S
    
    • ฮธ_T : Teacher์˜ ํŒŒ๋ผ๋ฏธํ„ฐ
    • ฮธ_S : Student์˜ ํŒŒ๋ผ๋ฏธํ„ฐ
    • m : ๋ชจ๋ฉ˜ํ…€ ๊ณ„์ˆ˜ (๋ณดํ†ต 0.996 ~ 0.999)
  • Student๋Š” ์—ญ์ „ํŒŒ(Backpropagation)๋กœ ๋น ๋ฅด๊ฒŒ ํ•™์Šต๋˜๋Š” ๋ฐ˜๋ฉด!!
  • Teacher์™€ Student๋Š” ๊ตฌ์กฐ๊ฐ€ ์™„์ „ํžˆ ๋™์ผํ•˜๊ธฐ ๋•Œ๋ฌธ์—, ํ•˜๋‚˜ํ•˜๋‚˜์˜ ์„ ์ƒ๋‹˜์˜ weight๋Š” ์œ„์˜ ์ˆ˜์‹๋Œ€๋กœ ์—…๋ฐ์ดํŠธ๋ฉ๋‹ˆ๋‹ค!!

๐Ÿš€ ๋‹ค์‹œํ•œ๋ฒˆ ์ •๋ฆฌํ•ด๋ณด๋Š” DINO์˜ ์ค‘์š”์„ฑ

  • ๋ ˆ์ด๋ธ” ๋ถˆํ•„์š”: ์ˆ˜์ž‘์—… ์ฃผ์„ ์—†์ด ๊ณ ํ’ˆ์งˆ ํŠน์ง•(feature)์„ ํ•™์Šตํ•  ์ˆ˜ ์žˆ๋‹ค.
  • ๋ฒ”์šฉ์„ฑ: ๊ธฐ์กด ViT๊ฐ€ Classfication์—๋งŒ ์ ์šฉํ–ˆ๋‹ค๋ฉด ์ด๋ฅผ ๋„˜์–ด image segmentation, iamge Clustering, ์ด๋ฏธ์ง€ ๊ฒ€์ƒ‰ ๋“ฑ ๋‹ค์–‘ํ•œ ์ž‘์—…์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•ด์คŒ!!
  • ViT์˜ ๋ฐœ์ „ ๊ฐ€๋Šฅ์„ฑ: DINO๊ฐ€ ์žˆ๊ธฐ์— ViT๊ฐ€ ์ธ์ฝ”๋”๋กœ์„œ ๋” ๋‹ค์–‘ํ•œ ๋ฐœ์ „ ๊ฐ€๋Šฅ์„ฑ์„ ๋ณด์—ฌ์ฃผ๊ฒŒ๋ฉ๋‹ˆ๋‹ค!!

๐Ÿ“ˆ DINO์˜ ์ฃผ์š” ๊ฒฐ๊ณผ

classification

  • DINO๋Š” ImageNet์—์„œ ๊ฐ•๋ ฅํ•œ Classification ๊ฒฐ๊ณผ๋ฅผ ๋ณด์—ฌ์คฌ๋‹ค!!
    • ๊ทธ๋Ÿฐ๋ฐ! ์ธ์ฝ”๋”์ธ๋ฐ ์–ด๋–ป๊ฒŒ classification!?
    • Linear Probing (์„ ํ˜• ๋ถ„๋ฅ˜๊ธฐ ํ‰๊ฐ€) ๋ฐฉ์‹์„ ํ™œ์šฉ!! - ๋’ค์—์„œ ์ž์„ธํžˆ ์†Œ๊ฐœ!!
  • ๋ฟ๋งŒ์•„๋‹ˆ๋ผ, ํด๋Ÿฌ์Šคํ„ฐ๋ง, ๊ฒ€์ƒ‰, Detection ๋“ฑ ๋‹ค์–‘ํ•œ ๋ถ€๋ถ„์—์„œ ๋งŽ์€ ๊ฐ•์ ์„ ๋ณด์—ฌ์ฃผ์—ˆ๋‹ค!!
  • Ablation Study์—์„œ ๊ฐ ๊ธฐ๋Šฅ๋ณ„๋กœ์˜ ์„ฑ๋Šฅ์„ ํ™•์ธํ•ด๋ด„!! - ๋’ค์—์„œ ์ž์„ธํžˆ ์†Œ๊ฐœ!!

โœ… Linear Probing : DINO๊ฐ€ ๋ฝ‘์€ feature๊ฐ€ ์–ผ๋งˆ๋‚˜ โ€œ์ž˜ ๊ตฌ๋ถ„๋˜๊ฒŒ(linearly separable)โ€ ๊ตฌ์„ฑ๋˜์–ด ์žˆ๋Š”์ง€๋ฅผ ์ธก์ •ํ•˜๋Š” ๋ฐฉ์‹

  1. DINO๋กœ ํ•™์Šต๋œ ViT ์ธ์ฝ”๋”๋ฅผ ๋™๊ฒฐ(freeze)
  2. ๊ทธ ์œ„์— ๊ฐ„๋‹จํ•œ ์„ ํ˜• ๋ถ„๋ฅ˜๊ธฐ (Linear Classifier)๋ฅผ ํ•˜๋‚˜ ์ถ”๊ฐ€
  3. ์ด Linear Layer๋งŒ ImageNet ๋ผ๋ฒจ์„ ์‚ฌ์šฉํ•ด ํ•™์Šต
  4. ์ด ๊ตฌ์กฐ๋กœ ImageNet Top-1 Accuracy ๋“ฑ์„ ์ธก์ •ํ•˜์—ฌ ์ธ์ฝ”๋”์˜ ํ‘œํ˜„๋ ฅ์ด ์–ผ๋งˆ๋‚˜ ์ข‹์€์ง€ ํ‰๊ฐ€
    • DINO ํ‘œํ˜„์€ ์ด๋ฏธ์ง€ ๋‚ด ์ฃผ์š” ์˜์—ญ์„ ๊ฑฐ์˜ ์ง€๋„ ์—†์ด ์„ธ๋ถ„ํ™”ํ•  ์ˆ˜ ์žˆ๋Š” ๋Šฅ๋ ฅ ๋ฐœํœ˜
    • ๋‹ค์–‘ํ•œ ์•„ํ‚คํ…์ฒ˜(ResNet, ViT, ํ•˜์ด๋ธŒ๋ฆฌ๋“œ)์—์„œ๋„ ์ž‘๋™ํ•œ๋‹ค.

๐Ÿงช Ablation Study: ๊ตฌ์„ฑ ์š”์†Œ๋ณ„ ์„ฑ๋Šฅ ์˜ํ–ฅ ๋ถ„์„

  • DINO ๋ชจ๋ธ์˜ ์•„๋ž˜ ๊ตฌ์„ฑ ์š”์†Œ๋“ค๋ฅผ ์ถ”๊ฐ€ํ•˜๊ฑฐ๋‚˜ ์ œ๊ฑฐํ•˜๋ฉด์„œ ๋ชจ๋ธ ์„ฑ๋Šฅ์— ์–ด๋–ค ๋ณ€ํ™”๊ฐ€ ์žˆ๋Š”์ง€ ์‹คํ—˜์ ์œผ๋กœ ํ™•์ธ
๊ตฌ์„ฑ ์š”์†Œ์˜๋ฏธ์œ ๋ฌด์— ๋”ฐ๋ฅธ ๊ฒฐ๊ณผ ์š”์•ฝ
Mom (Momentum Encoder)Student์˜ ๊ฐ€์ค‘์น˜๋กœ Teacher๋ฅผ EMA ๋ฐฉ์‹์œผ๋กœ ์—…๋ฐ์ดํŠธ์—†์œผ๋ฉด ์„ ์ƒ๋‹˜์ด ๊ณต๋ถ€๋ฅผ ์•ˆํ•˜๋Š”๊ฑฐ! ๋ชจ๋ธ collapse, ์„ฑ๋Šฅ ๊ธ‰๊ฐ. ํ•ต์‹ฌ ๊ตฌ์„ฑ ์š”์†Œ
SK (Sinkhorn Normalization)๋ถ„ํฌ๋ฅผ ๊ท ๋“ฑํ•˜๊ฒŒ ์ •๊ทœํ™”ํ•˜๋Š” ๋ฐฉ์‹๋ชจ๋ฉ˜ํ…€์ด ์—†์„ ๋•Œ๋งŒ collapse ๋ฐฉ์ง€ ํšจ๊ณผ. ๋ชจ๋ฉ˜ํ…€์ด ์žˆ์œผ๋ฉด ๋ถˆํ•„์š”
MC (Multi-Crop)ํ•œ ์ด๋ฏธ์ง€๋ฅผ ๋‹ค์–‘ํ•œ ํฌ๊ธฐ๋กœ ์ž˜๋ผ ์—ฌ๋Ÿฌ ๋ทฐ ์ƒ์„ฑRepresentation ํ’ˆ์งˆ์„ ํฌ๊ฒŒ ํ–ฅ์ƒ์‹œํ‚ด. ์ค‘์š”๋„ ๋†’์Œ
CE (Cross-Entropy Loss)Student์™€ Teacher์˜ ๋ถ„ํฌ ์ •๋ ฌ ์†์‹ค ํ•จ์ˆ˜DINO์˜ ํ•ต์‹ฌ ํ•™์Šต ์†์‹ค ํ•จ์ˆ˜, ์—†์œผ๋ฉด ์„ฑ๋Šฅ ์ €ํ•˜
Pred (Predictor)Student์— ์ถ”๊ฐ€๋œ ์ž‘์€ MLP ์˜ˆ์ธก๊ธฐDINO์—์„œ๋Š” ์˜ํ–ฅ ๊ฑฐ์˜ ์—†์Œ. BYOL์—์„  ํ•„์ˆ˜์˜€๋˜ ์š”์†Œ

โœจ ๋งˆ๋ฌด๋ฆฌํ•˜๋ฉฐ

๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ์˜ ์‹œ๋Œ€์ธ ์ง€๊ธˆ!! ์ด๋ฏธ์ง€๋ฅผ ๋ฒกํ„ฐ๋กœ ๋ฐ”๊พธ๋Š”๊ฒƒ์€ ๋„ˆ๋ฌด๋‚˜ ์ž์—ฐ์Šค๋Ÿฌ์šด๋ฐ์š”~

DINO๋Š” ๋ณ„๋„์˜ ๋ผ๋ฒจ ์—†์ด ์ž์ฒด ํ•™์Šต์œผ๋กœ!! ViT์˜ ๋ฌดํ•œํ•œ ๋ฐœ์ „ ๊ฐ€๋Šฅ์„ฑ์„ ์—ด์–ด์ค€ ์—ฐ๊ตฌ์ธ๊ฒƒ ๊ฐ™์Šต๋‹ˆ๋‹ค!!

์ตœ๊ทผ Deepseek ๋“ฑ์œผ๋กœ knowledge distillation์ด ์ฃผ๋ชฉ๋ฐ›๊ณ ์žˆ๋Š”๋ฐ!!

์Šค์Šค๋กœ 2๊ฐœ์˜ ๋ชจ๋ธ์„ ํ•™์Šต์‹œํ‚ค๋ฉฐ ๋ฐœ์ „์‹œ์ผฐ๋‹ค๋Š”๊ฒƒ์ด ์ธ์ƒ๊นŠ์Šต๋‹ˆ๋‹ค!!

๋” ์ฐพ์•„๋ณด๋‹ˆ ์ด ์—ฐ๊ตฌ๋Š” BYOL(Bootstrap Your Own Latent) ์ด๋ผ๋Š”, Deepmind์˜ ์—ฐ๊ตฌ์—์„œ ์ฒ˜์Œ ์ œ์•ˆ๋˜์—ˆ๋‹ค๊ณ ํ•ฉ๋‹ˆ๋‹ค!!

๋‹ค์Œ๋ฒˆ์—” BYOL์„ ๊ณต๋ถ€ํ•ด ๋ด์•ผ๊ฒ ๋„ค์š”~!^^

This post is licensed under CC BY 4.0 by the author.