📝 Image? You Can Do Transformer Too!! - The Emergence of ViT!! - 이미지? 너도 Transformer 할수있어!! - ViT의 등장!! (ICLR 2021)

Posted Mar 24, 2025

By DrFirst

29 min read

Image? You Can Do Transformer Too!! - The Emergence of ViT!! (ICLR 2021)

Hello everyone! 👋

Today, let’s explore the Vision Transformer (ViT), a revolutionary approach in computer vision that’s garnering significant attention!

🕰️ The State of AI Before ViT’s Emergence

In text analysis, word preprocessing, TF-IDF based DTM (Document-Term Matrix), BoW, and Word2Vec were widely used.
Then, the paper “Attention is All You Need” emerged, marking the beginning of Transformer-based innovation!
Various language models like BERT and GPT quickly appeared, leading to the rapid development of the text AI field.

But what about image analysis? Still relying on the method of inputting pixel data into CNNs. Research based on ResNet, which appeared in 2015, was dominant. There was a limitation in modeling the global relationships of the entire image.

💡 The Emergence of ViT!!

What if we tokenize images like text and put them into an attention model?

Existing text Transformers analyzed sentences by treating each word as a token.
ViT divided the image into 16×16 patches, treating each patch as a token.
It introduced a method of classifying images (Classification) by inputting these patches into a Transformer.

🔍 Understanding the Detailed Structure of ViT

1. Image Patching

Divides the input image into fixed-size patches.
Example: Splitting a 224×224 image into 16×16 patches generates a total of 196 patches.
Each of the 196 patches contains dimension values of 16 X 16 X 3 (since a color image has 3 color channels: RGB)!

2. Linear Projection

Transforms each patch into a fixed-dimensional vector through a linear layer.
Similar to creating word embeddings in text.
In the previous step, each of the 196 patches is converted into a 1X768 (768=16X16X3) one-dimensional vector!!

3. Positional Encoding - Different methods used for each model

Since Transformer does not recognize order, positional information is added to each patch to support learning the order between patches.

🧠 Key Summary

Category	Description
Why needed?	Because Transformer cannot recognize order
How?	By adding positional information to patches
Method	Sine/cosine function based (fixed) / Learnable embeddings (ViT)
Result	The model can understand the relative and absolute positional information between patches

📌 Why is it needed?

Transformer uses the Self-Attention mechanism, but it inherently lacks the ability to recognize the input order (patch order).
Therefore, information is needed to tell the model what the order is between the input tokens (whether text words or image patches).

Even in images, patches are not just gathered together, they have meaning through their left-right, top-bottom, and surrounding relationships. Therefore, the model needs to know “where is this patch located?”.

🛠️ How is it done?

A Positional Encoding vector is added to or combined with each patch vector to include it in the input.
There are two main methods:
1. Sine/cosine based static (Positional Encoding)
  - Generates patterns based on certain mathematical functions (sine, cosine).
  - Used in the Transformer paper “Attention Is All You Need”.
2. Learnable positional embeddings
  - Generates a vector for each position, and these vectors are learned together during the training process.
  - ViT primarily used learnable positional embeddings.

➡️ Thanks to this, the model can learn information about “where this patch is located,” and Attention not only looks at values but also considers “spatial context!”

4. Class Token: Adding a one-line summary information in front of the patches!!!

A [CLS] token, representing the entire image, is added! So, if you split an image into 16×16 patches, a [CLS] is added in front of the total 196 patches, making the total number of patches 197! At this time, the CLS will also consist of 768 elements, just like other patches, right!? This CLS token will ultimately represent the classification result of the image!!

📌 What is it?

The [CLS] token (Classification Token) is a special token added to the beginning of the Transformer input sequence.
This [CLS] token is initialized with a learnable weight vector from the beginning.
Purpose: To contain summary information about the entire input image (or text), used to obtain the final classification result.

🛠️ How does it work?

After dividing the input image patches, each patch is embedded.
The [CLS] embedding vector is added to the front of this embedding sequence.
The entire sequence ([CLS] + patches) is input into the Transformer encoder.
After passing through the Transformer encoder,
- The [CLS] token interacts with all patches centered on itself (Self-Attention) and gathers information.
Finally, only the [CLS] token from the Transformer output is taken and put into the Classification Head (MLP Layer) to produce the final prediction value.

To put it simply: “[CLS] token = a summary reporter for this image” It’s a structure where the model gathers the characteristics of the patches and summarizes them into one [CLS] token.

🎯 Why is it needed?

Although Transformer processes the entire sequence, the output also comes out separately for each token.
Since we need to know “what class does this entire input (image) belong to?”, a single vector representing all the information is needed.
The [CLS] token serves exactly that purpose.

🧠 Characteristics of the [CLS] token in ViT

The [CLS] token starts with a learnable initial random value.
During training, it evolves into a vector that increasingly “summarizes the entire image” by giving and receiving Attention with other patches.
This final vector can be used for image classification, feature extraction, and downstream tasks.

✨ Key Summary

Category	Description
Role	Stores summary information representing the entire input (image)
Method	Add [CLS] token before input patches, then aggregate information through Attention
Result Usage	Final classification result (connected to the Classification Head)

5. Transformer Encoder

ViT’s Transformer Encoder is the core module that naturally endows the model with the “ability to see the entire image”!!

🛠️ Transformer Encoder Block Components

A single Transformer Encoder consists largely of two modules:

Multi-Head Self-Attention (MSA)
Multi-Layer Perceptron (MLP)

Layer Normalization and Residual Connection are always added between these two blocks.

🧩 Detailed Structure Flow

LayerNorm Performs normalization on the input sequence (patches + [CLS]) first.
Multi-Head Self-Attention (MSA)
- Allows each token to interact with every other token.
- Multiple Attention Heads operate in parallel to capture various relationships.
- Learns the global context between all image patches through Self-Attention.
Residual Connection
- Adds the Attention result to the input.
- Makes learning more stable and alleviates the Gradient Vanishing problem.
LayerNorm
- Performs normalization again.
Multi-Layer Perceptron (MLP)
- Contains two Linear Layers and an activation function (GELU) in between.
- Transforms the feature representation of each token (patch) more complexly.
Residual Connection
- Adds the MLP result to the input.

``` 🔄 Overall Block Flow Input (Patch Sequence + [CLS]) ↓ LayerNorm ↓ Multi-Head Self-Attention ↓ Residual Connection (Input + Attention Output) ↓ LayerNorm ↓ MLP (2 Linear + GELU) ↓ Residual Connection (Input + MLP Output) Output (Passed to the next block or the final result)

---

#### 🎯 Key Role of the Transformer Encoder

| Component | Role |
|:---|:---|
| MSA (Self-Attention) | Learns the relationships between patches (captures global context) |
| MLP | Transforms the characteristics of each token non-linearly |
| LayerNorm | Improves learning stability |
| Residual Connection | Preserves information flow and stabilizes learning |

---

#### 🧠 Significance of the Transformer Encoder in ViT

- Unlike CNNs that process primarily based on local patterns,
- The Transformer Encoder **models global patch-to-patch relationships all at once**.
- In particular, the [CLS] token learns to summarize the entire image during this process.

---

### 6. Classification Head

- Predicts the image class by passing the final [CLS] token through an **MLP classifier**.
- This is similar to CNN classification with ResNet, right!? The 768-dimensional CLS proceeds with classification through MLP!!

---

## 🚀 Exploring the Inference Process of ViT!

> Let's summarize the flow of how a trained Vision Transformer model performs classification when a new image is input through the above process!

### 📸 1. Divide the image into patches

- The input image is divided into small pieces of **16×16 pixels**.
- For example, a 224×224 image is divided into a total of **196 patches**.

### 🔗 2. Combine CLS token with patch embeddings

- The **[CLS] embedding vector**, prepared during the training process, is added to the very beginning.
- Therefore, the input sequence consists of a total of **197 vectors**.
  (1 [CLS] + 196 patches)

### 🧠 3. Input to Transformer Encoder

- These 197 vectors are passed through the **Transformer Encoder**.
- The Transformer learns the relationships between each patch and the [CLS] through Self-Attention.
- The [CLS] comes to summarize the information of all patches.

### 🎯 4. Extract CLS token

- Only the **[CLS] token at the very beginning** of the Transformer output is extracted separately.
- This vector is a representation that aggregates the overall features of the input image.

### 🏁 5. Perform classification through MLP Head

- The final [CLS] vector is input into the **MLP Head** (Multi-Layer Perceptron).
- The MLP Head receives this vector and predicts **Class probabilities**.
  - (e.g., one of: cat, dog, car)

``` ✅ Final ViT Inference Flow Summary
Image Input
↓
Divide into Patches
↓
Combine [CLS] + Patches
↓
Pass through Transformer Encoder
↓
Extract [CLS] Vector
↓
Classify with MLP Head
↓
Output Final Prediction Result!

The World Changed After ViT’s Emergence!!

Importance of Large-Scale Data

ViT has weak inductive biases, so it can be weaker than CNNs on small datasets, but
It showed performance surpassing CNNs when pre-trained on hundreds of millions or billions of images.

Computational Efficiency

For very large models, the computational cost required for pre-training can be lower than that of CNNs.

Global Relationship Modeling

Naturally models long-range dependencies within the image through Self-Attention.
Since Self-Attention doesn’t care about the distance between input tokens (= image patches) (distance is an important factor in CNNs!!),
All patches are directly connected to all other patches!!!
- Patch #1 can directly ask Patch #196, “Is this important?”
- It doesn’t matter if they are 1, 10, or 100 pixels apart.
In other words, distant patches can directly influence each other!

Interpretability

By visualizing the Attention Map (like CAM!!), it becomes possible to intuitively interpret which parts of the image the model is focusing on.

🌎 ViT’s Impact

After ViT’s success, research on Transformer architectures exploded in the field of computer vision.
Subsequently, various Vision Transformer family models (DeiT, Swin Transformer, etc.) emerged, showing excellent performance in various fields such as:
- Image Recognition (Image Classification)
- Object Detection
- Image Segmentation
Various ViT-based models such as DINO and CLIP have emerged!!

🔗 References

AN IMAGE IS WORTH 16X16 WORDS: TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE

Vision Transformer is not just one model, it was a true paradigm shift that “ushered in the Transformer era in the field of vision.”

(Korean) 이미지? 너도 Transformer 할수있어!! - ViT의 등장!! (ICLR 2021)

안녕하세요, 여러분! 👋

오늘은 컴퓨터 비전 분야에서 혁신적인 접근 방식으로 주목받고 있는
비전 트랜스포머(Vision Transformer, ViT)에 대하여 함께 알아보겠습니다~!

🕰️ ViT 등장 전의 AI 분야의 현황

텍스트 분석에서는 단어 전처리, TF-IDF 기반 DTM(Document-Term Matrix), BoW, Word2Vec 등이 널리 사용되고 있었다.
그러던 중 “Attention is All You Need” 논문이 등장하며, Transformer 기반의 혁신이 시작되었다!
BERT, GPT 등 다양한 언어모델들이 빠르게 등장하고 텍스트 AI 영역이 급속도로 발전했다.

그런데 이미지 분석은?
여전히 픽셀 데이터를 CNN에 입력하는 방식.
2015년 등장한 ResNet 기반 연구가 주를 이뤘으며,
이미지 전체의 전역적인 관계를 모델링하는 데는 한계가 있었다.

💡 ViT의 등장!!

이미지도 텍스트처럼 토큰화해서 어텐션 모델에 넣어보자!

기존 텍스트 Transformer는 문장의 단어 하나하나를 토큰으로 분석했다.
ViT는 이미지를 16×16 패치로 쪼개어, 각 패치를 토큰처럼 다루었다.
이 패치들을 Transformer에 입력하여 이미지를 분류(Classification) 하는 방식을 도입했다.

🔍 ViT의 세부 구조 이해하기

1. 이미지 패칭 (Image Patching)

입력 이미지를 고정 크기의 패치로 분할.
예: 224×224 이미지를 16×16 크기의 패치로 쪼개면 총 196개 패치 생성.
196개의 패치 각각은 16 X 16 X 3(컬러이미지는 RGB니까 3차원의 정보) 의 차원값을 담고있음!

2. 선형 임베딩 (Linear Projection)

각 패치를 선형 레이어를 통해 고정된 차원의 벡터로 변환.
텍스트에서 단어 임베딩을 만드는 것과 유사.
전 단계에서 196개의 패치 각각을 1X768 (768=16X16X3)의 일차원 벡터로 바꾸는것임!!

3. 위치 인코딩 (Positional Encoding) - 모델마다 각기 다른 방법 사용

Transformer는 순서를 인식하지 못하기 때문에,
각 패치에 위치 정보를 더해 패치 간 순서를 학습할 수 있도록 지원.

🧠 핵심 요약

구분	설명
왜 필요?	Transformer는 순서를 인식할 수 없기 때문
어떻게?	패치에 위치 정보를 더해줌
방식	사인·코사인 함수 기반 (고정) / 학습 가능한 임베딩 (ViT)
결과	패치 간 상대적, 절대적 위치 정보를 모델이 이해할 수 있게 됨

📌 왜 필요한가?

Transformer는 Self-Attention 메커니즘을 사용하지만,
자체적으로 입력 순서(패치 순서)를 인식할 능력은 없다.
따라서 입력 토큰(텍스트 단어든, 이미지 패치든)이 서로 어떤 순서에 있는지 알려주는 정보가 필요하다.

이미지에서도 패치들은 단순히 모여 있는 게 아니라,
좌우, 상하, 주변 관계를 통해 의미를 가진다.
따라서 “어디에 있는 패치인가?”를 모델이 알 수 있어야 한다.

🛠️ 어떻게 하는가?

각 패치 벡터에 Positional Encoding 벡터를 더하거나 합쳐서 입력에 포함시킨다.
대표적인 방식은 두 가지다:
1. 사인·코사인 기반 정적(Positional Encoding)
  - 일정 수학 함수(사인, 코사인)를 기반으로 패턴을 생성.
  - Transformer 논문 “Attention Is All You Need”에서 사용됨.
2. 학습 가능한(learnable) 위치 임베딩
  - 위치별로 벡터를 생성하고, 이 벡터를 학습 과정에서 함께 학습함.
  - ViT에서는 학습 가능한 위치 임베딩을 주로 사용했다.

➡️ 덕분에 모델은 “이 패치가 어느 위치에 있는지”에 대한 정보를 학습할 수 있고,
Attention이 단순히 값만 보지 않고, “공간적 맥락”까지 고려하게 된다!

4. 클래스 토큰 (Class Token) : 패치 앞에 한줄요약정보를 추가함!!!

전체 이미지를 대표하는 [CLS] 토큰을 추가합니다!
그래서, 16×16 크기의 패치로 쪼개면 총 196개 패치앞에 [CLS] 가 들어가서, 패치는 총 197개가 됩니다! 이때, CLS도 다른 패치와 동일하게 768개 요쇼로 구성되겠지요!? 이 CLS 토큰이 최종적으로 이미지의 분류 결과를 대표하게된다!!

📌 무엇인가?

[CLS] 토큰(Classification Token) 은 Transformer 입력 시퀀스의 맨 앞에 추가하는 특별한 토큰이다.
이 [CLS] 토큰은 처음부터 모델이 학습할 수 있는 가중치 벡터로 초기화된다.
목적:
입력된 이미지(또는 텍스트) 전체에 대한 요약 정보를 담아,
최종 분류(Classification) 결과를 얻기 위해 사용된다.

🛠️ 어떻게 동작하지!?

입력 이미지 패치들을 나눈 뒤, 각 패치를 임베딩한다.
이 임베딩 시퀀스 맨 앞에 [CLS] 임베딩 벡터를 추가한다.
전체 시퀀스([CLS] + 패치들)를 Transformer 인코더에 입력한다.
Transformer 인코더를 거친 후,
- [CLS] 토큰은 자신을 중심으로 전체 패치들과 상호작용(Self-Attention)하며 정보를 모은다.
최종적으로 Transformer 출력 중 [CLS] 토큰만을 가져와
Classification Head (MLP Layer)에 넣어 최종 예측값을 만든다.

쉽게 말하면:
“CLS 토큰 = 이 이미지에 대한 요약 리포터”
모델이 패치들의 특성을 모아 [CLS] 토큰 하나에 요약해두는 구조다.

🎯 왜 필요할까??

Transformer는 시퀀스 전체를 처리하긴 하지만,
출력 역시 각 토큰별로 따로 나온다.
우리는 “이 전체 입력(이미지)이 어떤 클래스에 속하나?”를 알아야 하므로,
모든 정보를 대표하는 하나의 벡터가 필요하다.
[CLS] 토큰이 바로 그 역할을 한다.

🧠 ViT에서 [CLS] 토큰의 특징

[CLS] 토큰은 학습 가능한 초기랜덤값(random initialized)으로 시작된다.
Training 동안 다른 패치들과 Attention을 주고받으며 점점 더 “이미지 전체를 요약하는” 벡터로 진화한다.
이 최종 벡터를 통해 이미지 분류, 특성 추출, 다운스트림 태스크 등에 활용할 수 있다.

✨ 핵심 요약

구분	설명
역할	전체 입력(이미지)을 대표하는 요약 정보 저장
방식	입력 패치 앞에 [CLS] 토큰 추가 후, Attention을 통해 정보 집약
결과 사용	최종 분류 결과 (Classification Head에 연결)

5. 트랜스포머 인코더 (Transformer Encoder)

ViT의 Transformer Encoder는 “이미지 전체를 보는 능력” 을 자연스럽게 모델에 부여하는 핵심 모듈!!

🛠️ Transformer Encoder 블록 구성 요소

Transformer Encoder 하나는 크게 두 가지 모듈로 구성된다:

Multi-Head Self-Attention (MSA)
Multi-Layer Perceptron (MLP)

이 두 블록 사이에는 항상 Layer Normalization과 Residual Connection(잔차 연결) 이 추가된다.

🧩 자세한 구조의 흐름

LayerNorm
입력 시퀀스(패치 + [CLS])에 대해 먼저 정규화 수행.
Multi-Head Self-Attention (MSA)
- 각각의 토큰이 다른 모든 토큰과 상호작용할 수 있게 한다.
- 여러 개의 Attention Head가 병렬로 동작하여 다양한 관계를 포착.
- Self-Attention을 통해, 이미지 전체 패치 간 글로벌(Global) 컨텍스트를 학습한다.
Residual Connection
- Attention 결과를 입력에 더해줌.
- 학습이 더 안정되고, Gradient Vanishing 문제를 완화.
LayerNorm
- 다시 한번 정규화 수행.
Multi-Layer Perceptron (MLP)
- 두 개의 Linear Layer(선형 변환)와 중간의 활성화 함수(GELU)를 포함.
- 각 토큰(패치)의 feature representation을 더 복잡하게 변환.
Residual Connection
- MLP 결과를 입력에 더해줌.

🔄 전체 블록 플로우

입력 (패치 시퀀스 + [CLS])
  ↓ LayerNorm
  ↓ Multi-Head Self-Attention
  ↓ Residual Connection (입력 + Attention 출력)
  ↓ LayerNorm
  ↓ MLP (2 Linear + GELU)
  ↓ Residual Connection (입력 + MLP 출력)
출력 (다음 블록 또는 최종 결과로 전달)

🎯 트랜스포머 인코더의 핵심 역할

구성 요소	역할
MSA (Self-Attention)	패치들 간의 관계를 학습 (글로벌 컨텍스트 포착)
MLP	각 토큰의 특성을 비선형적으로 변환
LayerNorm	학습 안정성 향상
Residual Connection	정보 흐름을 보존하고, 학습 안정화

🧠 ViT에서 트랜스포머 인코더의 의미

CNN이 지역적(local) 패턴을 중심으로 처리하는 것과 달리,
트랜스포머 인코더는 전역(Global) 패치 간 관계를 한 번에 모델링한다.
특히 [CLS] 토큰은 이 과정 속에서 전체 이미지를 요약하는 역할을 하도록 학습된다.

6. 분류 헤드 (Classification Head)

최종 [CLS] 토큰을 MLP 분류기에 통과시켜 이미지 클래스를 예측.
여기서는 기존 resnet 등 CNN 분류와 마찬가지지요!? 768차원의 CLS가 MLP를 통해 classification을 진행합니다!!

🚀 ViT의 추론(inference) 과정 살펴보기!

위 과정을 통해 학습된 Vision Transformer 모델에
새로운 이미지가 입력되었을 때 어떤 흐름으로 분류가 이루어지는지 정리해봅시다!

📸 1. 이미지를 패치로 쪼개기

입력 이미지를 16×16 크기의 작은 조각들로 나눕니다.
예를 들어 224×224 이미지는 총 196개 패치로 분할됩니다.

🔗 2. CLS 토큰과 패치 임베딩 결합

학습 과정에서 준비된 [CLS] 임베딩 벡터를 가장 앞에 추가합니다.
따라서 입력 시퀀스는 총 197개 벡터로 구성됩니다.
(1개 [CLS] + 196개 패치)

🧠 3. Transformer Encoder에 입력

이 197개의 벡터를 Transformer Encoder에 통과시킵니다.
Transformer는 각 패치와 [CLS] 간의 관계를 Self-Attention으로 학습합니다.
[CLS]는 모든 패치들의 정보를 요약하게 됩니다.

🎯 4. CLS 토큰 추출

Transformer 출력 결과 중 맨 앞에 있는 [CLS] 토큰만 따로 꺼냅니다.
이 벡터는 입력 이미지의 전체 특징을 집약한 표현입니다.

🏁 5. MLP Head를 통해 분류 수행

최종 [CLS] 벡터를 MLP Head (Multi-Layer Perceptron)에 입력합니다.
MLP Head는 이 벡터를 받아 클래스(Class) 확률을 예측합니다.
- (예: 고양이, 강아지, 자동차 중 하나)

✅ 최종 ViT의 추론 흐름 요약

이미지 입력
↓
패치로 분할
↓
[CLS] + 패치들 합치기
↓
Transformer Encoder 통과
↓
[CLS] 벡터 추출
↓
MLP Head로 분류
↓
최종 예측 결과 출력!

ViT 등장 이후, 바뀌어 버린 세상!!

대규모 데이터의 중요성

ViT는 상식적 지식이 약하기 때문에, 작은 데이터셋에서는 CNN보다 약할 수 있지만,
수억 장 이상의 대규모 데이터로 사전학습(pretraining) 시 CNN을 능가하는 성능을 보였습니다.

계산 효율성

매우 큰 모델의 경우, CNN에 비해 사전학습에 필요한 계산량이 더 적을 수 있음.

전역 관계 모델링

Self-Attention을 통해 이미지 내 장거리 의존성(long-range dependencies) 을 자연스럽게 모델링.
Self-Attention은 입력 토큰(= 이미지 패치들) 간 거리를 신경 쓰지 않기에, (CNN은 이미지 내의 거리가 중요한요소!!)
모든 패치가 모든 패치와 바로 직접 연결됩니다~!!!
패치 1번이 패치 196번까지 바로 “얘 중요해?” 하고 물어볼 수 있어요.
거리 1칸, 10칸, 100칸 떨어져 있어도 상관없어요.
즉, 멀리 떨어진 패치끼리도 곧바로 영향을 주고받을 수 있다!

해석 가능성

Attention Map을 시각화하여 (CAM처럼!!)
모델이 이미지의 어떤 부분에 주목하는지 직관적으로 해석할 수 있게 됨.

🌎 ViT의 영향력

ViT의 성공 이후, 컴퓨터 비전 분야에서 Transformer 아키텍처 연구가 폭발적으로 증가했다.
이후 다양한 비전 트랜스포머 계열 모델 (DeiT, Swin Transformer 등)이 등장하여:
- 이미지 인식 (Image Classification)
- 객체 검출 (Object Detection)
- 이미지 분할 (Image Segmentation) 등 여러 분야에서 뛰어난 성능을 보여주고 있다.
DINO, CLIP 등 다양한 ViT 기반의 모델들이 나왔습니다!!

🔗 참고

AN IMAGE IS WORTH 16X16 WORDS: TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE

Vision Transformer는 단순히 모델 하나를 넘어,
“비전 분야에서도 Transformer 시대를 열어젖힌” 진정한 패러다임 전환이었습니다.

AI, Research

This post is licensed under CC BY 4.0 by the author.