📝 ViT, you can do greater things! - The emergence of DINO!! // ViT, 너는 더 큰일을 할수있어! - DINO의 등장!! (ICCV 2021)

Posted Apr 30, 2025

By DrFirst

20 min read

(English) 🧠 What is DINO?

Study of “Emerging Properties in Self-Supervised Vision Transformers” (ICCV, 2021)

📖 Paper Title: Emerging Properties in Self-Supervised Vision Transformers
✍️ Authors: Facebook AI Research (Caron, Mathilde, et al.)
🌟 One-line Summary: A core model where ViT learns by creating its own teacher and student models without labels!

📚 Core Idea

DINO stands for Distillation with No Labels!
It enables a student network to mimic the teacher network’s output
without any external labels, in a self-supervised learning manner!
DINO acts more as an encoder that transforms images into vectors, rather than merely performing image classification.
Teacher and Student models are trained together, but ultimately the Student model is used.
The Teacher observes the full view of the image,
The Student sees cropped or transformed views and tries to learn the same information!
The Teacher model slowly updates together with the Student’s improvements!

🔍 Background of DINO

Limitations of Supervised Learning:
- With the rise of ViT, image classification advanced significantly.
- However, Vision models still heavily relied on large, labeled datasets like ImageNet.
- Labeling is costly and error-prone, especially in certain domains.
Limitations of Previous Self-Supervised Models:
- Prior to DINO, most image SSL methods were CNN-based.
- CNNs mainly capture local information, while ViT can capture global context.
Therefore!! It was perfect timing for a model that could self-supervise images using ViT!

🔍 How DINO Works

1. Two Networks: Student and Teacher

Both networks consist of a ViT Encoder + MLP Projection Head.
The ViT backbone starts with randomly initialized weights!
Note: ResNet can replace ViT as the backbone too!

[Input Image]
    ↓
Patchify (for ViT)
    ↓
ViT Backbone (or ResNet)
    ↓
[CLS] Token Output
    ↓
MLP Projection Head
    ↓
Output Vector (for contrastive-like loss)

2. View Generation: Applying various augmentations to create multiple training views from the same image

Create 2 Global Views and
6 Local Views
Thus, a total of 8 different augmented views are generated from a single image.
🌍 Global Views

Item	Description
Quantity	2
Size	224 × 224 (same as ViT input size)
Usage	Both Student and Teacher use them (Teacher uses only global views)
Augmentations	Random resized crop, color jittering, blur, and other strong augmentations

🔎 Local Views

Item	Description
Quantity	Typically 6 (varies depending on the experiment)
Size	Small crops like 96 × 96
Usage	Only used by the Student
Augmentations	Random crop, color jitter, grayscale conversion, blur, etc.
Purpose	Train the model to infer the full concept even from partial views

✨ Main Augmentation Techniques

Augmentation	Description
Random Resized Cropping	Randomly crop and resize images to allow seeing diverse parts of the image.
Color Jittering (brightness, contrast, saturation, hue)	Randomly change brightness, contrast, saturation, and hue to vary color characteristics.
Random Grayscale	Randomly convert images to grayscale, helping the model rely less on color.
Gaussian Blur	Apply blur to reduce sharpness, making the model robust to low-clarity images.
Solarization	Invert the bright regions of an image, adding variance.
Horizontal Flip	Flip the image horizontally.
Normalization	Normalize pixel values to mean 0 and standard deviation 1 to stabilize training.

3. Prediction Alignment: The Student learns to match the Teacher’s output distribution from different views

Even though the Student and Teacher see different versions of the image,
they should produce the same output vector!
This is the core Self-Distillation approach of DINO.

🔍 Core Idea
- In self-supervised learning, no ground truth labels are available.
- Different augmented views of the same image are created and fed into the Student and Teacher separately.
- The Teacher’s output vector is treated as a “pseudo-label”.
- The Student is trained to predict as close as possible to the Teacher’s output.
🧠 Process Flow
1. Create two different views of the same image.
2. View 1 → Input to Teacher → Generate fixed output vector.
3. View 2 → Input to Student → Generate trainable output vector.
4. Minimize the Cross-Entropy loss between the Student’s and Teacher’s outputs.
```
Teacher output: t = softmax(h_T(x₁) / τ_T)
Student output: s = softmax(h_S(x₂) / τ_S)
Loss = cross_entropy(t, s)
```
θ_T : Teacher’s parameters
θ_S : Student’s parameters
m : Momentum coefficient (typically between 0.996 and 0.999)
While the Student learns quickly via Backpropagation,
the Teacher and Student have identical architectures,
thus each of the Teacher’s weights is updated following the formula above!

🚀 Reconfirming the Importance of DINO

No labels required: Capable of learning high-quality features without manual annotation.
Versatility: While traditional ViTs were limited to classification,
DINO expands the use to image segmentation, image clustering, image retrieval, and more!
Potential of ViT: Thanks to DINO, ViT shows much greater potential as a general-purpose encoder!

📈 Key Achievements of DINO

DINO demonstrated strong classification performance on ImageNet!
- But wait — DINO is an encoder! How does it classify?
- It uses the Linear Probing approach! (will be explained in detail later!)
Additionally, DINO showed outstanding strengths in clustering, retrieval, detection, and many more tasks!
Through Ablation Studies, the contributions of each component were carefully verified! (details explained later!)

✅ Linear Probing: Evaluating how linearly separable DINO’s features are

Freeze the ViT encoder trained by DINO.
Add a simple Linear Classifier on top.
Train only the Linear layer using ImageNet labels.
Measure ImageNet Top-1 Accuracy to evaluate the quality of extracted features.

DINO’s features can almost segment major regions in images without supervision.
DINO works effectively across various architectures (ResNet, ViT, hybrid).

🧪 Ablation Study: Analyzing the Impact of Each Component

In the Ablation Study, the following components were added or removed to analyze their effect on model performance:

Component	Meaning	Impact of Addition/Removal
Mom (Momentum Encoder)	Updates the Teacher’s parameters via EMA based on Student	Without it → Teacher doesn’t learn → Model collapse, severe performance drop. A critical component.
SK (Sinkhorn Normalization)	Normalizes the output distribution evenly	Helps prevent collapse only if no momentum. Not necessary if momentum is present.
MC (Multi-Crop)	Creates multiple views of different scales from an image	Significantly improves feature quality, highly important.
CE (Cross-Entropy Loss)	Aligns the Student’s and Teacher’s output distributions	The core loss function in DINO, removing it degrades performance.
Pred (Predictor)	A small MLP added to the Student’s output	Has minimal effect in DINO. Was critical in BYOL.

✨ Final Thoughts

In today’s multimodal AI era,
transforming images into meaningful vector representations has become very natural.

✅ DINO is a remarkable study that opened up the limitless potential of ViT,
✅ enabling powerful learning without requiring labels!

Recently, with the rise of techniques like knowledge distillation (e.g., Deepseek),
the ability to train two models together in a self-supervised manner is gaining renewed attention.

Interestingly, this line of research traces back to BYOL (Bootstrap Your Own Latent),
a pioneering study originally proposed by DeepMind.

Next, I definitely plan to dive into studying BYOL as well! 😊

(한국어) 🧠 ViT, 너는 더 큰일을 할수있어! - DINO의 등장!!

『Emerging Properties in Self-Supervised Vision Transformers』(ICCV, 2021) 공부

📖 논문 제목: Emerging Properties in Self-Supervised Vision Transformers
✍️ 저자: Facebook AI Research (Caron, Mathilde, et al)
🌟 한줄 요약: Label 없이 스스로 선생님과 학생모델을 만들어 학습한 ViT의 핵심 모델!!

📚 핵심 아이디어

DINO란 Distillation with No Labels의 약자로!,
별도의 레이블 없이!!
학생 네트워크가 교사 네트워크의 출력을 모방하도록 학습하는 Self-Supervised Learning 방법!!
ViT와 같은 Image classification이라기보단, 이미지를 벡터로 바꾸는 인코더로서 역할을한다.
교사와 학생모델로 구분되서 학습되고, 최종 모델은 학습모델이 사용된다!
교사 모델은 큰 그림을 보고 있으며,
학생 모델은 잘리거나 변환된 이미지를 보고있는데,
학생 모델이 이 변환된 이미지에서 교사 모델의 정보와 동일한 데이터를 추출하도록 학습한다!!
교사모델도 학생의 발전과 함께 천천히 학습한다!!

🔍 DINO 연구의 배경!

Supervised Learning의 한계!
- ViT의 등장으로 이미지 Classification에 많은 발전이 있었지만!,
- 여전히 Vision 모델은 ImageNet 등 대규모 라벨링된 데이터셋에 의존하고있었다.
- 이는 △ 라벨링 비용이 크고 △ 일부 도메인에서는 라벨이 아예 없거나 부정확한 문제가 있었다!!
Self-Supervised 모델들의 한계 : CNN 기반
- ViT이 등장한지 얼마 안되었기에, 이미 있는 이미지의 Self-Supervised 는 모두 CNN 기반이었다.
- CNN은 지역적 정보에에 의존하기에, Transformer의 특징인 전역 정보를 잘 활용하지 못했다.
그래서!! 이미지를!, Self-Supervised 방식으로 학습할수 있는 모델이 필요한 타이밍이었다!!!

🔍 DINO의 작동 원리

1. 두 개의 네트워크: 학생(Student)과 교사(Teacher) 네트워크.

각 네트워크는 ViT 인코더 + MLP Projection Head로 구성되었다!!
ViT Backbone이란 randon initialized된 ViT의 가중치를 사용해서 만듦!!
ViT 인코더는 CNN 인코더로도 대체가 된다는 놀라운 사실!!

  [Input Image]
      ↓
  Patchify (for ViT)
      ↓
  ViT Backbone (or ResNet)
      ↓
  [CLS] Token Output
      ↓
  MLP Projection Head
      ↓
  Output Vector (for contrastive-like loss)

2. 뷰 생성: 동일한 이미지에 다양한 증강을 적용해 여러 학습용 버전(“views”)을 만듦

총 2개의 글로벌 뷰(global views) 와
6개의 로컬 뷰(local views) 를 생성
즉, 한 이미지에서 총 8개의 서로 다른 증강 이미지를 만듦

🌍 글로벌 뷰 (Global Views)

항목	설명
수량	2개
크기	224 × 224 (ViT 입력 사이즈와 동일)
용도	Student & Teacher 모두 사용 (Teacher는 오직 글로벌 뷰만 입력받음)
증강	랜덤 resized crop, 색상 왜곡, blur 등 강한 증강 포함

🔎 로컬 뷰 (Local Views)

항목	설명
수량	보통 6개 (논문에서 다양하게 실험됨)
크기	96 × 96 등 소형 crop
용도	Student만 입력 받음
증강	랜덤 crop, 색상 왜곡, grayscale, blur 등
목적	작은 영역만 보고도 전체 컨셉을 추론하도록 학습 유도

✨ 주요 증강 기법

증강 기법	설명
Random Resized Cropping	이미지를 무작위로 자르고 크기 조정. 같은 이미지를 다양한 시점에서 보도록 하기 위해 사용.
Color Jittering (brightness, contrast, saturation, hue)	밝기, 대비, 채도, 색조 등을 무작위로 변화. 색상 의존 감소.
Random Grayscale	이미지를 흑백 전환. 색이 없어도 인식할 수 있게 훈련.
Gaussian Blur	이미지에 블러 부여. 선명하지 않은 이해할 수 있도록 훈련.
Solarization	밝은 부분을 반전. 다양한 광량 조건에서 테스트
Horizontal Flip	이미지 좌우 반전. 방향이 바뀌어도 같은 대상을 인식할 수 있도록 함
Normalization	이미지 픽셀 값을 평균 0, 표준편차 1로 정규화. 학습 안정성과 속도를 높이기 위한 기본 처리

3. 예측 정렬: 학생은 서로 다른 뷰에서 교사의 출력 분포(가짜 레이블)를 맞추도록 학습

선생 모델과 학생 모델이 다른 이미지를 봤지만 같은 분석결과(벡터)를 내놔야 해!”
이것이 DINO의 핵심인 Self-Distillation 방식!!

-🔍 개념 요약 - Self-supervised 방식에서는 정답 레이블이 없기에
- 같은 이미지의 서로 다른 뷰(augmented views)를 만들어 각각 Student와 Teacher에 입력
- Teacher의 출력 벡터를 일종의 “가짜 정답(label)”로 보고,
- Student가 이 출력을 최대한 비슷하게 예측하도록 학습!!!

🧠 과정 흐름
1. 같은 이미지를 두 가지 뷰로 만듭니다 (View 1, View 2).
2. View 1 → Teacher에 입력 → 고정된 output 벡터 생성
3. View 2 → Student에 입력 → 학습 중인 output 벡터 생성
4. Student의 출력을 Teacher의 출력과 비슷하게 정렬하도록 Loss를 계산(cross_entropy)
5. 이때 Loss를 줄이는 방향으로 Student의 모델이 학습됨
  - 학습의 과정을 자세히 보면
  - Teacher 출력: t = softmax(h_T(x₁) / τ_T)
  - Student 출력: s = softmax(h_S(x₂) / τ_S)
  - Loss: Loss = cross_entropy(t, s)
  - τ_T, τ_Stemperature: 낮을수록 출력값의 영향이 커지니 더 빠르게 가중치 변화에 영향을 미침

4. 교사 업데이트: 학생 뿐만 아니라 선생님도 학습을 한다!! 다만, 천천히!!

만약 선생님이 진도를 너무 빠르게 나가면 학생이 햇갈리겟지요!?
이에, 일반적인 지도 학습과 달리, DINO의 Teacher는 직접 학습되는 게 아니라
Student 네트워크를 기반으로 천천히 업데이트됩니다.

학습되는 방법은!? Student의 파라미터를 지수이동평균(EMA, Exponential Moving Average)
```
θ_T ← m × θ_T + (1 - m) × θ_S
```
- θ_T : Teacher의 파라미터
- θ_S : Student의 파라미터
- m : 모멘텀 계수 (보통 0.996 ~ 0.999)
Student는 역전파(Backpropagation)로 빠르게 학습되는 반면!!
Teacher와 Student는 구조가 완전히 동일하기 때문에, 하나하나의 선생님의 weight는 위의 수식대로 업데이트됩니다!!

🚀 다시한번 정리해보는 DINO의 중요성

레이블 불필요: 수작업 주석 없이 고품질 특징(feature)을 학습할 수 있다.
범용성: 기존 ViT가 Classfication에만 적용했다면 이를 넘어 image segmentation, iamge Clustering, 이미지 검색 등 다양한 작업을 가능하게 해줌!!
ViT의 발전 가능성: DINO가 있기에 ViT가 인코더로서 더 다양한 발전 가능성을 보여주게됩니다!!

📈 DINO의 주요 결과

DINO는 ImageNet에서 강력한 Classification 결과를 보여줬다!!
- 그런데! 인코더인데 어떻게 classification!?
- Linear Probing (선형 분류기 평가) 방식을 활용!! - 뒤에서 자세히 소개!!
뿐만아니라, 클러스터링, 검색, Detection 등 다양한 부분에서 많은 강점을 보여주었다!!
Ablation Study에서 각 기능별로의 성능을 확인해봄!! - 뒤에서 자세히 소개!!

✅ Linear Probing : DINO가 뽑은 feature가 얼마나 “잘 구분되게(linearly separable)” 구성되어 있는지를 측정하는 방식

DINO로 학습된 ViT 인코더를 동결(freeze)
그 위에 간단한 선형 분류기 (Linear Classifier)를 하나 추가
이 Linear Layer만 ImageNet 라벨을 사용해 학습
이 구조로 ImageNet Top-1 Accuracy 등을 측정하여 인코더의 표현력이 얼마나 좋은지 평가
- DINO 표현은 이미지 내 주요 영역을 거의 지도 없이 세분화할 수 있는 능력 발휘
- 다양한 아키텍처(ResNet, ViT, 하이브리드)에서도 작동한다.

🧪 Ablation Study: 구성 요소별 성능 영향 분석

DINO 모델의 아래 구성 요소들를 추가하거나 제거하면서 모델 성능에 어떤 변화가 있는지 실험적으로 확인

구성 요소	의미	유무에 따른 결과 요약
Mom (Momentum Encoder)	Student의 가중치로 Teacher를 EMA 방식으로 업데이트	없으면 선생님이 공부를 안하는거! 모델 collapse, 성능 급감. 핵심 구성 요소
SK (Sinkhorn Normalization)	분포를 균등하게 정규화하는 방식	모멘텀이 없을 때만 collapse 방지 효과. 모멘텀이 있으면 불필요
MC (Multi-Crop)	한 이미지를 다양한 크기로 잘라 여러 뷰 생성	Representation 품질을 크게 향상시킴. 중요도 높음
CE (Cross-Entropy Loss)	Student와 Teacher의 분포 정렬 손실 함수	DINO의 핵심 학습 손실 함수, 없으면 성능 저하
Pred (Predictor)	Student에 추가된 작은 MLP 예측기	DINO에서는 영향 거의 없음. BYOL에선 필수였던 요소

✨ 마무리하며

멀티모달의 시대인 지금!! 이미지를 벡터로 바꾸는것은 너무나 자연스러운데요~

DINO는 별도의 라벨 없이 자체 학습으로!! ViT의 무한한 발전 가능성을 열어준 연구인것 같습니다!!

최근 Deepseek 등으로 knowledge distillation이 주목받고있는데!!

스스로 2개의 모델을 학습시키며 발전시켰다는것이 인상깊습니다!!

더 찾아보니 이 연구는 BYOL(Bootstrap Your Own Latent) 이라는, Deepmind의 연구에서 처음 제안되었다고합니다!!

다음번엔 BYOL을 공부해 봐야겠네요~!^^

AI, Research

This post is licensed under CC BY 4.0 by the author.