๐ ViT, you can do greater things! - The emergence of DINO!! // ViT, ๋๋ ๋ ํฐ์ผ์ ํ ์์์ด! - DINO์ ๋ฑ์ฅ!! (ICCV 2021)
(English) ๐ง What is DINO?
Study of โEmerging Properties in Self-Supervised Vision Transformersโ (ICCV, 2021)
๐ Paper Title: Emerging Properties in Self-Supervised Vision Transformers
โ๏ธ Authors: Facebook AI Research (Caron, Mathilde, et al.)
๐ One-line Summary: A core model where ViT learns by creating its own teacher and student models without labels!
๐ Core Idea
- DINO stands for Distillation with No Labels!
- It enables a student network to mimic the teacher networkโs output
without any external labels, in a self-supervised learning manner! - DINO acts more as an encoder that transforms images into vectors, rather than merely performing image classification.
- Teacher and Student models are trained together, but ultimately the Student model is used.
- The Teacher observes the full view of the image,
- The Student sees cropped or transformed views and tries to learn the same information!
- The Teacher model slowly updates together with the Studentโs improvements!
๐ Background of DINO
- Limitations of Supervised Learning:
- With the rise of ViT, image classification advanced significantly.
- However, Vision models still heavily relied on large, labeled datasets like ImageNet.
- Labeling is costly and error-prone, especially in certain domains.
- Limitations of Previous Self-Supervised Models:
- Prior to DINO, most image SSL methods were CNN-based.
- CNNs mainly capture local information, while ViT can capture global context.
- Therefore!! It was perfect timing for a model that could self-supervise images using ViT!
๐ How DINO Works
1. Two Networks: Student and Teacher
- Both networks consist of a ViT Encoder + MLP Projection Head.
- The ViT backbone starts with randomly initialized weights!
- Note: ResNet can replace ViT as the backbone too!
1
2
3
4
5
6
7
8
9
10
11
[Input Image]
โ
Patchify (for ViT)
โ
ViT Backbone (or ResNet)
โ
[CLS] Token Output
โ
MLP Projection Head
โ
Output Vector (for contrastive-like loss)
2. View Generation: Applying various augmentations to create multiple training views from the same image
- Create 2 Global Views and
- 6 Local Views
Thus, a total of 8 different augmented views are generated from a single image.
- ๐ Global Views
Item | Description |
---|---|
Quantity | 2 |
Size | 224 ร 224 (same as ViT input size) |
Usage | Both Student and Teacher use them (Teacher uses only global views) |
Augmentations | Random resized crop, color jittering, blur, and other strong augmentations |
- ๐ Local Views
Item | Description |
---|---|
Quantity | Typically 6 (varies depending on the experiment) |
Size | Small crops like 96 ร 96 |
Usage | Only used by the Student |
Augmentations | Random crop, color jitter, grayscale conversion, blur, etc. |
Purpose | Train the model to infer the full concept even from partial views |
- โจ Main Augmentation Techniques
Augmentation | Description |
---|---|
Random Resized Cropping | Randomly crop and resize images to allow seeing diverse parts of the image. |
Color Jittering (brightness, contrast, saturation, hue) | Randomly change brightness, contrast, saturation, and hue to vary color characteristics. |
Random Grayscale | Randomly convert images to grayscale, helping the model rely less on color. |
Gaussian Blur | Apply blur to reduce sharpness, making the model robust to low-clarity images. |
Solarization | Invert the bright regions of an image, adding variance. |
Horizontal Flip | Flip the image horizontally. |
Normalization | Normalize pixel values to mean 0 and standard deviation 1 to stabilize training. |
3. Prediction Alignment: The Student learns to match the Teacherโs output distribution from different views
Even though the Student and Teacher see different versions of the image,
they should produce the same output vector!
This is the core Self-Distillation approach of DINO.
- ๐ Core Idea
- In self-supervised learning, no ground truth labels are available.
- Different augmented views of the same image are created and fed into the Student and Teacher separately.
- The Teacherโs output vector is treated as a โpseudo-labelโ.
- The Student is trained to predict as close as possible to the Teacherโs output.
๐ง Process Flow
- Create two different views of the same image.
- View 1 โ Input to Teacher โ Generate fixed output vector.
- View 2 โ Input to Student โ Generate trainable output vector.
- Minimize the Cross-Entropy loss between the Studentโs and Teacherโs outputs.
Teacher output: t = softmax(h_T(xโ) / ฯ_T) Student output: s = softmax(h_S(xโ) / ฯ_S) Loss = cross_entropy(t, s)
- ฮธ_T : Teacherโs parameters
- ฮธ_S : Studentโs parameters
m : Momentum coefficient (typically between 0.996 and 0.999)
- While the Student learns quickly via Backpropagation,
- the Teacher and Student have identical architectures,
- thus each of the Teacherโs weights is updated following the formula above!
๐ Reconfirming the Importance of DINO
- No labels required: Capable of learning high-quality features without manual annotation.
- Versatility: While traditional ViTs were limited to classification,
DINO expands the use to image segmentation, image clustering, image retrieval, and more! - Potential of ViT: Thanks to DINO, ViT shows much greater potential as a general-purpose encoder!
๐ Key Achievements of DINO
- DINO demonstrated strong classification performance on ImageNet!
- But wait โ DINO is an encoder! How does it classify?
- It uses the Linear Probing approach! (will be explained in detail later!)
- Additionally, DINO showed outstanding strengths in clustering, retrieval, detection, and many more tasks!
- Through Ablation Studies, the contributions of each component were carefully verified! (details explained later!)
โ Linear Probing: Evaluating how linearly separable DINOโs features are
- Freeze the ViT encoder trained by DINO.
- Add a simple Linear Classifier on top.
- Train only the Linear layer using ImageNet labels.
- Measure ImageNet Top-1 Accuracy to evaluate the quality of extracted features.
- DINOโs features can almost segment major regions in images without supervision.
- DINO works effectively across various architectures (ResNet, ViT, hybrid).
๐งช Ablation Study: Analyzing the Impact of Each Component
- In the Ablation Study, the following components were added or removed to analyze their effect on model performance:
Component | Meaning | Impact of Addition/Removal |
---|---|---|
Mom (Momentum Encoder) | Updates the Teacherโs parameters via EMA based on Student | Without it โ Teacher doesnโt learn โ Model collapse, severe performance drop. A critical component. |
SK (Sinkhorn Normalization) | Normalizes the output distribution evenly | Helps prevent collapse only if no momentum. Not necessary if momentum is present. |
MC (Multi-Crop) | Creates multiple views of different scales from an image | Significantly improves feature quality, highly important. |
CE (Cross-Entropy Loss) | Aligns the Studentโs and Teacherโs output distributions | The core loss function in DINO, removing it degrades performance. |
Pred (Predictor) | A small MLP added to the Studentโs output | Has minimal effect in DINO. Was critical in BYOL. |
โจ Final Thoughts
In todayโs multimodal AI era,
transforming images into meaningful vector representations has become very natural.
โ
DINO is a remarkable study that opened up the limitless potential of ViT,
โ
enabling powerful learning without requiring labels!
Recently, with the rise of techniques like knowledge distillation (e.g., Deepseek),
the ability to train two models together in a self-supervised manner is gaining renewed attention.
Interestingly, this line of research traces back to BYOL (Bootstrap Your Own Latent),
a pioneering study originally proposed by DeepMind.
Next, I definitely plan to dive into studying BYOL as well! ๐
(ํ๊ตญ์ด) ๐ง ViT, ๋๋ ๋ ํฐ์ผ์ ํ ์์์ด! - DINO์ ๋ฑ์ฅ!!
ใEmerging Properties in Self-Supervised Vision Transformersใ(ICCV, 2021) ๊ณต๋ถ
๐ ๋
ผ๋ฌธ ์ ๋ชฉ: Emerging Properties in Self-Supervised Vision Transformers
โ๏ธ ์ ์: Facebook AI Research (Caron, Mathilde, et al)
๐ ํ์ค ์์ฝ: Label ์์ด ์ค์ค๋ก ์ ์๋๊ณผ ํ์๋ชจ๋ธ์ ๋ง๋ค์ด ํ์ตํ ViT์ ํต์ฌ ๋ชจ๋ธ!!
๐ ํต์ฌ ์์ด๋์ด
- DINO๋ Distillation with No Labels์ ์ฝ์๋ก!,
- ๋ณ๋์ ๋ ์ด๋ธ ์์ด!!
- ํ์ ๋คํธ์ํฌ๊ฐ ๊ต์ฌ ๋คํธ์ํฌ์ ์ถ๋ ฅ์ ๋ชจ๋ฐฉํ๋๋ก ํ์ตํ๋ Self-Supervised Learning ๋ฐฉ๋ฒ!!
- ViT์ ๊ฐ์ Image classification์ด๋ผ๊ธฐ๋ณด๋จ, ์ด๋ฏธ์ง๋ฅผ ๋ฒกํฐ๋ก ๋ฐ๊พธ๋ ์ธ์ฝ๋๋ก์ ์ญํ ์ํ๋ค.
- ๊ต์ฌ์ ํ์๋ชจ๋ธ๋ก ๊ตฌ๋ถ๋์ ํ์ต๋๊ณ , ์ต์ข ๋ชจ๋ธ์ ํ์ต๋ชจ๋ธ์ด ์ฌ์ฉ๋๋ค!
- ๊ต์ฌ ๋ชจ๋ธ์ ํฐ ๊ทธ๋ฆผ์ ๋ณด๊ณ ์์ผ๋ฉฐ,
- ํ์ ๋ชจ๋ธ์ ์๋ฆฌ๊ฑฐ๋ ๋ณํ๋ ์ด๋ฏธ์ง๋ฅผ ๋ณด๊ณ ์๋๋ฐ,
- ํ์ ๋ชจ๋ธ์ด ์ด ๋ณํ๋ ์ด๋ฏธ์ง์์ ๊ต์ฌ ๋ชจ๋ธ์ ์ ๋ณด์ ๋์ผํ ๋ฐ์ดํฐ๋ฅผ ์ถ์ถํ๋๋ก ํ์ตํ๋ค!!
- ๊ต์ฌ๋ชจ๋ธ๋ ํ์์ ๋ฐ์ ๊ณผ ํจ๊ป ์ฒ์ฒํ ํ์ตํ๋ค!!
๐ DINO ์ฐ๊ตฌ์ ๋ฐฐ๊ฒฝ!
- Supervised Learning์ ํ๊ณ!
- ViT์ ๋ฑ์ฅ์ผ๋ก ์ด๋ฏธ์ง Classification์ ๋ง์ ๋ฐ์ ์ด ์์์ง๋ง!,
- ์ฌ์ ํ Vision ๋ชจ๋ธ์ ImageNet ๋ฑ ๋๊ท๋ชจ ๋ผ๋ฒจ๋ง๋ ๋ฐ์ดํฐ์ ์ ์์กดํ๊ณ ์์๋ค.
- ์ด๋ โณ ๋ผ๋ฒจ๋ง ๋น์ฉ์ด ํฌ๊ณ โณ ์ผ๋ถ ๋๋ฉ์ธ์์๋ ๋ผ๋ฒจ์ด ์์ ์๊ฑฐ๋ ๋ถ์ ํํ ๋ฌธ์ ๊ฐ ์์๋ค!!
- Self-Supervised ๋ชจ๋ธ๋ค์ ํ๊ณ : CNN ๊ธฐ๋ฐ
- ViT์ด ๋ฑ์ฅํ์ง ์ผ๋ง ์๋์๊ธฐ์, ์ด๋ฏธ ์๋ ์ด๋ฏธ์ง์ Self-Supervised ๋ ๋ชจ๋ CNN ๊ธฐ๋ฐ์ด์๋ค.
- CNN์ ์ง์ญ์ ์ ๋ณด์์ ์์กดํ๊ธฐ์, Transformer์ ํน์ง์ธ ์ ์ญ ์ ๋ณด๋ฅผ ์ ํ์ฉํ์ง ๋ชปํ๋ค.
- ๊ทธ๋์!! ์ด๋ฏธ์ง๋ฅผ!, Self-Supervised ๋ฐฉ์์ผ๋ก ํ์ตํ ์ ์๋ ๋ชจ๋ธ์ด ํ์ํ ํ์ด๋ฐ์ด์๋ค!!!
๐ DINO์ ์๋ ์๋ฆฌ
1. ๋ ๊ฐ์ ๋คํธ์ํฌ: ํ์(Student)๊ณผ ๊ต์ฌ(Teacher) ๋คํธ์ํฌ.
- ๊ฐ ๋คํธ์ํฌ๋ ViT ์ธ์ฝ๋ + MLP Projection Head๋ก ๊ตฌ์ฑ๋์๋ค!!
- ViT Backbone์ด๋ randon initialized๋ ViT์ ๊ฐ์ค์น๋ฅผ ์ฌ์ฉํด์ ๋ง๋ฆ!!
- ViT ์ธ์ฝ๋๋ CNN ์ธ์ฝ๋๋ก๋ ๋์ฒด๊ฐ ๋๋ค๋ ๋๋ผ์ด ์ฌ์ค!!
1
2
3
4
5
6
7
8
9
10
11
12
[Input Image]
โ
Patchify (for ViT)
โ
ViT Backbone (or ResNet)
โ
[CLS] Token Output
โ
MLP Projection Head
โ
Output Vector (for contrastive-like loss)
2. ๋ทฐ ์์ฑ: ๋์ผํ ์ด๋ฏธ์ง์ ๋ค์ํ ์ฆ๊ฐ์ ์ ์ฉํด ์ฌ๋ฌ ํ์ต์ฉ ๋ฒ์ (โviewsโ)์ ๋ง๋ฆ
- ์ด 2๊ฐ์ ๊ธ๋ก๋ฒ ๋ทฐ(global views) ์
- 6๊ฐ์ ๋ก์ปฌ ๋ทฐ(local views) ๋ฅผ ์์ฑ
- ์ฆ, ํ ์ด๋ฏธ์ง์์ ์ด 8๊ฐ์ ์๋ก ๋ค๋ฅธ ์ฆ๊ฐ ์ด๋ฏธ์ง๋ฅผ ๋ง๋ฆ
๐ ๊ธ๋ก๋ฒ ๋ทฐ (Global Views)
ํญ๋ชฉ ์ค๋ช ์๋ 2๊ฐ ํฌ๊ธฐ 224 ร 224 (ViT ์ ๋ ฅ ์ฌ์ด์ฆ์ ๋์ผ) ์ฉ๋ Student & Teacher ๋ชจ๋ ์ฌ์ฉ (Teacher๋ ์ค์ง ๊ธ๋ก๋ฒ ๋ทฐ๋ง ์ ๋ ฅ๋ฐ์) ์ฆ๊ฐ ๋๋ค resized crop, ์์ ์๊ณก, blur ๋ฑ ๊ฐํ ์ฆ๊ฐ ํฌํจ ๐ ๋ก์ปฌ ๋ทฐ (Local Views)
ํญ๋ชฉ ์ค๋ช ์๋ ๋ณดํต 6๊ฐ (๋ ผ๋ฌธ์์ ๋ค์ํ๊ฒ ์คํ๋จ) ํฌ๊ธฐ 96 ร 96 ๋ฑ ์ํ crop ์ฉ๋ Student๋ง ์ ๋ ฅ ๋ฐ์ ์ฆ๊ฐ ๋๋ค crop, ์์ ์๊ณก, grayscale, blur ๋ฑ ๋ชฉ์ ์์ ์์ญ๋ง ๋ณด๊ณ ๋ ์ ์ฒด ์ปจ์ ์ ์ถ๋ก ํ๋๋ก ํ์ต ์ ๋ โจ ์ฃผ์ ์ฆ๊ฐ ๊ธฐ๋ฒ
์ฆ๊ฐ ๊ธฐ๋ฒ ์ค๋ช Random Resized Cropping ์ด๋ฏธ์ง๋ฅผ ๋ฌด์์๋ก ์๋ฅด๊ณ ํฌ๊ธฐ ์กฐ์ . ๊ฐ์ ์ด๋ฏธ์ง๋ฅผ ๋ค์ํ ์์ ์์ ๋ณด๋๋ก ํ๊ธฐ ์ํด ์ฌ์ฉ. Color Jittering (brightness, contrast, saturation, hue) ๋ฐ๊ธฐ, ๋๋น, ์ฑ๋, ์์กฐ ๋ฑ์ ๋ฌด์์๋ก ๋ณํ. ์์ ์์กด ๊ฐ์. Random Grayscale ์ด๋ฏธ์ง๋ฅผ ํ๋ฐฑ ์ ํ. ์์ด ์์ด๋ ์ธ์ํ ์ ์๊ฒ ํ๋ จ. Gaussian Blur ์ด๋ฏธ์ง์ ๋ธ๋ฌ ๋ถ์ฌ. ์ ๋ช ํ์ง ์์ ์ดํดํ ์ ์๋๋ก ํ๋ จ. Solarization ๋ฐ์ ๋ถ๋ถ์ ๋ฐ์ . ๋ค์ํ ๊ด๋ ์กฐ๊ฑด์์ ํ ์คํธ Horizontal Flip ์ด๋ฏธ์ง ์ข์ฐ ๋ฐ์ . ๋ฐฉํฅ์ด ๋ฐ๋์ด๋ ๊ฐ์ ๋์์ ์ธ์ํ ์ ์๋๋ก ํจ Normalization ์ด๋ฏธ์ง ํฝ์ ๊ฐ์ ํ๊ท 0, ํ์คํธ์ฐจ 1๋ก ์ ๊ทํ. ํ์ต ์์ ์ฑ๊ณผ ์๋๋ฅผ ๋์ด๊ธฐ ์ํ ๊ธฐ๋ณธ ์ฒ๋ฆฌ
3. ์์ธก ์ ๋ ฌ: ํ์์ ์๋ก ๋ค๋ฅธ ๋ทฐ์์ ๊ต์ฌ์ ์ถ๋ ฅ ๋ถํฌ(๊ฐ์ง ๋ ์ด๋ธ)๋ฅผ ๋ง์ถ๋๋ก ํ์ต
์ ์ ๋ชจ๋ธ๊ณผ ํ์ ๋ชจ๋ธ์ด ๋ค๋ฅธ ์ด๋ฏธ์ง๋ฅผ ๋ดค์ง๋ง ๊ฐ์ ๋ถ์๊ฒฐ๊ณผ(๋ฒกํฐ)๋ฅผ ๋ด๋์ผ ํด!โ
์ด๊ฒ์ด DINO์ ํต์ฌ์ธ Self-Distillation ๋ฐฉ์!!
-๐ ๊ฐ๋
์์ฝ - Self-supervised ๋ฐฉ์์์๋ ์ ๋ต ๋ ์ด๋ธ์ด ์๊ธฐ์
- ๊ฐ์ ์ด๋ฏธ์ง์ ์๋ก ๋ค๋ฅธ ๋ทฐ(augmented views)๋ฅผ ๋ง๋ค์ด ๊ฐ๊ฐ Student์ Teacher์ ์
๋ ฅ
- Teacher์ ์ถ๋ ฅ ๋ฒกํฐ๋ฅผ ์ผ์ข
์ โ๊ฐ์ง ์ ๋ต(label)โ๋ก ๋ณด๊ณ ,
- Student๊ฐ ์ด ์ถ๋ ฅ์ ์ต๋ํ ๋น์ทํ๊ฒ ์์ธกํ๋๋ก ํ์ต!!!
๐ง ๊ณผ์ ํ๋ฆ
- ๊ฐ์ ์ด๋ฏธ์ง๋ฅผ ๋ ๊ฐ์ง ๋ทฐ๋ก ๋ง๋ญ๋๋ค (View 1, View 2).
- View 1 โ Teacher์ ์ ๋ ฅ โ ๊ณ ์ ๋ output ๋ฒกํฐ ์์ฑ
- View 2 โ Student์ ์ ๋ ฅ โ ํ์ต ์ค์ธ output ๋ฒกํฐ ์์ฑ
- Student์ ์ถ๋ ฅ์ Teacher์ ์ถ๋ ฅ๊ณผ ๋น์ทํ๊ฒ ์ ๋ ฌํ๋๋ก Loss๋ฅผ ๊ณ์ฐ(cross_entropy)
- ์ด๋ Loss๋ฅผ ์ค์ด๋ ๋ฐฉํฅ์ผ๋ก Student์ ๋ชจ๋ธ์ด ํ์ต๋จ
- ํ์ต์ ๊ณผ์ ์ ์์ธํ ๋ณด๋ฉด
- Teacher ์ถ๋ ฅ:
t = softmax(h_T(xโ) / ฯ_T)
- Student ์ถ๋ ฅ:
s = softmax(h_S(xโ) / ฯ_S)
- Loss:
Loss = cross_entropy(t, s)
- ฯ_T, ฯ_Stemperature: ๋ฎ์์๋ก ์ถ๋ ฅ๊ฐ์ ์ํฅ์ด ์ปค์ง๋ ๋ ๋น ๋ฅด๊ฒ ๊ฐ์ค์น ๋ณํ์ ์ํฅ์ ๋ฏธ์นจ
4. ๊ต์ฌ ์ ๋ฐ์ดํธ: ํ์ ๋ฟ๋ง ์๋๋ผ ์ ์๋๋ ํ์ต์ ํ๋ค!! ๋ค๋ง, ์ฒ์ฒํ!!
๋ง์ฝ ์ ์๋์ด ์ง๋๋ฅผ ๋๋ฌด ๋น ๋ฅด๊ฒ ๋๊ฐ๋ฉด ํ์์ด ํ๊ฐ๋ฆฌ๊ฒ์ง์!?
์ด์, ์ผ๋ฐ์ ์ธ ์ง๋ ํ์ต๊ณผ ๋ฌ๋ฆฌ, DINO์ Teacher๋ ์ง์ ํ์ต๋๋ ๊ฒ ์๋๋ผ
Student ๋คํธ์ํฌ๋ฅผ ๊ธฐ๋ฐ์ผ๋ก ์ฒ์ฒํ ์ ๋ฐ์ดํธ๋ฉ๋๋ค.
ํ์ต๋๋ ๋ฐฉ๋ฒ์!? Student์ ํ๋ผ๋ฏธํฐ๋ฅผ ์ง์์ด๋ํ๊ท (EMA, Exponential Moving Average)
ฮธ_T โ m ร ฮธ_T + (1 - m) ร ฮธ_S
- ฮธ_T : Teacher์ ํ๋ผ๋ฏธํฐ
- ฮธ_S : Student์ ํ๋ผ๋ฏธํฐ
- m : ๋ชจ๋ฉํ ๊ณ์ (๋ณดํต 0.996 ~ 0.999)
- Student๋ ์ญ์ ํ(Backpropagation)๋ก ๋น ๋ฅด๊ฒ ํ์ต๋๋ ๋ฐ๋ฉด!!
- Teacher์ Student๋ ๊ตฌ์กฐ๊ฐ ์์ ํ ๋์ผํ๊ธฐ ๋๋ฌธ์, ํ๋ํ๋์ ์ ์๋์ weight๋ ์์ ์์๋๋ก ์ ๋ฐ์ดํธ๋ฉ๋๋ค!!
๐ ๋ค์ํ๋ฒ ์ ๋ฆฌํด๋ณด๋ DINO์ ์ค์์ฑ
- ๋ ์ด๋ธ ๋ถํ์: ์์์ ์ฃผ์ ์์ด ๊ณ ํ์ง ํน์ง(feature)์ ํ์ตํ ์ ์๋ค.
- ๋ฒ์ฉ์ฑ: ๊ธฐ์กด ViT๊ฐ Classfication์๋ง ์ ์ฉํ๋ค๋ฉด ์ด๋ฅผ ๋์ด image segmentation, iamge Clustering, ์ด๋ฏธ์ง ๊ฒ์ ๋ฑ ๋ค์ํ ์์ ์ ๊ฐ๋ฅํ๊ฒ ํด์ค!!
- ViT์ ๋ฐ์ ๊ฐ๋ฅ์ฑ: DINO๊ฐ ์๊ธฐ์ ViT๊ฐ ์ธ์ฝ๋๋ก์ ๋ ๋ค์ํ ๋ฐ์ ๊ฐ๋ฅ์ฑ์ ๋ณด์ฌ์ฃผ๊ฒ๋ฉ๋๋ค!!
๐ DINO์ ์ฃผ์ ๊ฒฐ๊ณผ
- DINO๋ ImageNet์์ ๊ฐ๋ ฅํ Classification ๊ฒฐ๊ณผ๋ฅผ ๋ณด์ฌ์คฌ๋ค!!
- ๊ทธ๋ฐ๋ฐ! ์ธ์ฝ๋์ธ๋ฐ ์ด๋ป๊ฒ classification!?
- Linear Probing (์ ํ ๋ถ๋ฅ๊ธฐ ํ๊ฐ) ๋ฐฉ์์ ํ์ฉ!! - ๋ค์์ ์์ธํ ์๊ฐ!!
- ๋ฟ๋ง์๋๋ผ, ํด๋ฌ์คํฐ๋ง, ๊ฒ์, Detection ๋ฑ ๋ค์ํ ๋ถ๋ถ์์ ๋ง์ ๊ฐ์ ์ ๋ณด์ฌ์ฃผ์๋ค!!
- Ablation Study์์ ๊ฐ ๊ธฐ๋ฅ๋ณ๋ก์ ์ฑ๋ฅ์ ํ์ธํด๋ด!! - ๋ค์์ ์์ธํ ์๊ฐ!!
โ Linear Probing : DINO๊ฐ ๋ฝ์ feature๊ฐ ์ผ๋ง๋ โ์ ๊ตฌ๋ถ๋๊ฒ(linearly separable)โ ๊ตฌ์ฑ๋์ด ์๋์ง๋ฅผ ์ธก์ ํ๋ ๋ฐฉ์
- DINO๋ก ํ์ต๋ ViT ์ธ์ฝ๋๋ฅผ ๋๊ฒฐ(freeze)
- ๊ทธ ์์ ๊ฐ๋จํ ์ ํ ๋ถ๋ฅ๊ธฐ (Linear Classifier)๋ฅผ ํ๋ ์ถ๊ฐ
- ์ด Linear Layer๋ง ImageNet ๋ผ๋ฒจ์ ์ฌ์ฉํด ํ์ต
- ์ด ๊ตฌ์กฐ๋ก ImageNet Top-1 Accuracy ๋ฑ์ ์ธก์ ํ์ฌ ์ธ์ฝ๋์ ํํ๋ ฅ์ด ์ผ๋ง๋ ์ข์์ง ํ๊ฐ
- DINO ํํ์ ์ด๋ฏธ์ง ๋ด ์ฃผ์ ์์ญ์ ๊ฑฐ์ ์ง๋ ์์ด ์ธ๋ถํํ ์ ์๋ ๋ฅ๋ ฅ ๋ฐํ
- ๋ค์ํ ์ํคํ ์ฒ(ResNet, ViT, ํ์ด๋ธ๋ฆฌ๋)์์๋ ์๋ํ๋ค.
๐งช Ablation Study: ๊ตฌ์ฑ ์์๋ณ ์ฑ๋ฅ ์ํฅ ๋ถ์
- DINO ๋ชจ๋ธ์ ์๋ ๊ตฌ์ฑ ์์๋ค๋ฅผ ์ถ๊ฐํ๊ฑฐ๋ ์ ๊ฑฐํ๋ฉด์ ๋ชจ๋ธ ์ฑ๋ฅ์ ์ด๋ค ๋ณํ๊ฐ ์๋์ง ์คํ์ ์ผ๋ก ํ์ธ
๊ตฌ์ฑ ์์ | ์๋ฏธ | ์ ๋ฌด์ ๋ฐ๋ฅธ ๊ฒฐ๊ณผ ์์ฝ |
---|---|---|
Mom (Momentum Encoder) | Student์ ๊ฐ์ค์น๋ก Teacher๋ฅผ EMA ๋ฐฉ์์ผ๋ก ์ ๋ฐ์ดํธ | ์์ผ๋ฉด ์ ์๋์ด ๊ณต๋ถ๋ฅผ ์ํ๋๊ฑฐ! ๋ชจ๋ธ collapse, ์ฑ๋ฅ ๊ธ๊ฐ. ํต์ฌ ๊ตฌ์ฑ ์์ |
SK (Sinkhorn Normalization) | ๋ถํฌ๋ฅผ ๊ท ๋ฑํ๊ฒ ์ ๊ทํํ๋ ๋ฐฉ์ | ๋ชจ๋ฉํ ์ด ์์ ๋๋ง collapse ๋ฐฉ์ง ํจ๊ณผ. ๋ชจ๋ฉํ ์ด ์์ผ๋ฉด ๋ถํ์ |
MC (Multi-Crop) | ํ ์ด๋ฏธ์ง๋ฅผ ๋ค์ํ ํฌ๊ธฐ๋ก ์๋ผ ์ฌ๋ฌ ๋ทฐ ์์ฑ | Representation ํ์ง์ ํฌ๊ฒ ํฅ์์ํด. ์ค์๋ ๋์ |
CE (Cross-Entropy Loss) | Student์ Teacher์ ๋ถํฌ ์ ๋ ฌ ์์ค ํจ์ | DINO์ ํต์ฌ ํ์ต ์์ค ํจ์, ์์ผ๋ฉด ์ฑ๋ฅ ์ ํ |
Pred (Predictor) | Student์ ์ถ๊ฐ๋ ์์ MLP ์์ธก๊ธฐ | DINO์์๋ ์ํฅ ๊ฑฐ์ ์์. BYOL์์ ํ์์๋ ์์ |
โจ ๋ง๋ฌด๋ฆฌํ๋ฉฐ
๋ฉํฐ๋ชจ๋ฌ์ ์๋์ธ ์ง๊ธ!! ์ด๋ฏธ์ง๋ฅผ ๋ฒกํฐ๋ก ๋ฐ๊พธ๋๊ฒ์ ๋๋ฌด๋ ์์ฐ์ค๋ฌ์ด๋ฐ์~
DINO๋ ๋ณ๋์ ๋ผ๋ฒจ ์์ด ์์ฒด ํ์ต์ผ๋ก!! ViT์ ๋ฌดํํ ๋ฐ์ ๊ฐ๋ฅ์ฑ์ ์ด์ด์ค ์ฐ๊ตฌ์ธ๊ฒ ๊ฐ์ต๋๋ค!!
์ต๊ทผ Deepseek ๋ฑ์ผ๋ก knowledge distillation์ด ์ฃผ๋ชฉ๋ฐ๊ณ ์๋๋ฐ!!
์ค์ค๋ก 2๊ฐ์ ๋ชจ๋ธ์ ํ์ต์ํค๋ฉฐ ๋ฐ์ ์์ผฐ๋ค๋๊ฒ์ด ์ธ์๊น์ต๋๋ค!!
๋ ์ฐพ์๋ณด๋ ์ด ์ฐ๊ตฌ๋ BYOL(Bootstrap Your Own Latent) ์ด๋ผ๋, Deepmind์ ์ฐ๊ตฌ์์ ์ฒ์ ์ ์๋์๋ค๊ณ ํฉ๋๋ค!!
๋ค์๋ฒ์ BYOL์ ๊ณต๋ถํด ๋ด์ผ๊ฒ ๋ค์~!^^