๐ Image? You Can Do Transformer Too!! - The Emergence of ViT!! - ์ด๋ฏธ์ง? ๋๋ Transformer ํ ์์์ด!! - ViT์ ๋ฑ์ฅ!! (ICLR 2021)
Image? You Can Do Transformer Too!! - The Emergence of ViT!! (ICLR 2021)
Hello everyone! ๐
Today, letโs explore the Vision Transformer (ViT), a revolutionary approach in computer vision thatโs garnering significant attention!
๐ฐ๏ธ The State of AI Before ViTโs Emergence
- In text analysis, word preprocessing, TF-IDF based DTM (Document-Term Matrix), BoW, and Word2Vec were widely used.
- Then, the paper โAttention is All You Needโ emerged, marking the beginning of Transformer-based innovation!
- Various language models like BERT and GPT quickly appeared, leading to the rapid development of the text AI field.
But what about image analysis? Still relying on the method of inputting pixel data into CNNs. Research based on ResNet, which appeared in 2015, was dominant. There was a limitation in modeling the global relationships of the entire image.
๐ก The Emergence of ViT!!
What if we tokenize images like text and put them into an attention model?
- Existing text Transformers analyzed sentences by treating each word as a token.
- ViT divided the image into 16ร16 patches, treating each patch as a token.
- It introduced a method of classifying images (Classification) by inputting these patches into a Transformer.
๐ Understanding the Detailed Structure of ViT
1. Image Patching
- Divides the input image into fixed-size patches.
- Example: Splitting a 224ร224 image into 16ร16 patches generates a total of 196 patches.
- Each of the 196 patches contains dimension values of 16 X 16 X 3 (since a color image has 3 color channels: RGB)!
2. Linear Projection
- Transforms each patch into a fixed-dimensional vector through a linear layer.
- Similar to creating word embeddings in text.
- In the previous step, each of the 196 patches is converted into a 1X768 (768=16X16X3) one-dimensional vector!!
3. Positional Encoding - Different methods used for each model
Since Transformer does not recognize order, positional information is added to each patch to support learning the order between patches.
๐ง Key Summary
Category | Description |
---|---|
Why needed? | Because Transformer cannot recognize order |
How? | By adding positional information to patches |
Method | Sine/cosine function based (fixed) / Learnable embeddings (ViT) |
Result | The model can understand the relative and absolute positional information between patches |
๐ Why is it needed?
- Transformer uses the Self-Attention mechanism, but it inherently lacks the ability to recognize the input order (patch order).
- Therefore, information is needed to tell the model what the order is between the input tokens (whether text words or image patches).
Even in images, patches are not just gathered together, they have meaning through their left-right, top-bottom, and surrounding relationships. Therefore, the model needs to know โwhere is this patch located?โ.
๐ ๏ธ How is it done?
- A Positional Encoding vector is added to or combined with each patch vector to include it in the input.
- There are two main methods:
- Sine/cosine based static (Positional Encoding)
- Generates patterns based on certain mathematical functions (sine, cosine).
- Used in the Transformer paper โAttention Is All You Needโ.
- Learnable positional embeddings
- Generates a vector for each position, and these vectors are learned together during the training process.
- ViT primarily used learnable positional embeddings.
- Sine/cosine based static (Positional Encoding)
โก๏ธ Thanks to this, the model can learn information about โwhere this patch is located,โ and Attention not only looks at values but also considers โspatial context!โ
4. Class Token: Adding a one-line summary information in front of the patches!!!
A [CLS] token, representing the entire image, is added! So, if you split an image into 16ร16 patches, a [CLS] is added in front of the total 196 patches, making the total number of patches 197! At this time, the CLS will also consist of 768 elements, just like other patches, right!? This CLS token will ultimately represent the classification result of the image!!
๐ What is it?
- The [CLS] token (Classification Token) is a special token added to the beginning of the Transformer input sequence.
- This [CLS] token is initialized with a learnable weight vector from the beginning.
- Purpose: To contain summary information about the entire input image (or text), used to obtain the final classification result.
๐ ๏ธ How does it work?
- After dividing the input image patches, each patch is embedded.
- The [CLS] embedding vector is added to the front of this embedding sequence.
- The entire sequence ([CLS] + patches) is input into the Transformer encoder.
- After passing through the Transformer encoder,
- The [CLS] token interacts with all patches centered on itself (Self-Attention) and gathers information.
- Finally, only the [CLS] token from the Transformer output is taken and put into the Classification Head (MLP Layer) to produce the final prediction value.
To put it simply: โ[CLS] token = a summary reporter for this imageโ Itโs a structure where the model gathers the characteristics of the patches and summarizes them into one [CLS] token.
๐ฏ Why is it needed?
- Although Transformer processes the entire sequence, the output also comes out separately for each token.
- Since we need to know โwhat class does this entire input (image) belong to?โ, a single vector representing all the information is needed.
- The [CLS] token serves exactly that purpose.
๐ง Characteristics of the [CLS] token in ViT
- The [CLS] token starts with a learnable initial random value.
- During training, it evolves into a vector that increasingly โsummarizes the entire imageโ by giving and receiving Attention with other patches.
- This final vector can be used for image classification, feature extraction, and downstream tasks.
โจ Key Summary
Category | Description |
---|---|
Role | Stores summary information representing the entire input (image) |
Method | Add [CLS] token before input patches, then aggregate information through Attention |
Result Usage | Final classification result (connected to the Classification Head) |
5. Transformer Encoder
ViTโs Transformer Encoder is the core module that naturally endows the model with the โability to see the entire imageโ!!
๐ ๏ธ Transformer Encoder Block Components
A single Transformer Encoder consists largely of two modules:
- Multi-Head Self-Attention (MSA)
- Multi-Layer Perceptron (MLP)
Layer Normalization and Residual Connection are always added between these two blocks.
๐งฉ Detailed Structure Flow
LayerNorm Performs normalization on the input sequence (patches + [CLS]) first.
- Multi-Head Self-Attention (MSA)
- Allows each token to interact with every other token.
- Multiple Attention Heads operate in parallel to capture various relationships.
- Learns the global context between all image patches through Self-Attention.
- Residual Connection
- Adds the Attention result to the input.
- Makes learning more stable and alleviates the Gradient Vanishing problem.
- LayerNorm
- Performs normalization again.
- Multi-Layer Perceptron (MLP)
- Contains two Linear Layers and an activation function (GELU) in between.
- Transforms the feature representation of each token (patch) more complexly.
- Residual Connection
- Adds the MLP result to the input.
``` ๐ Overall Block Flow Input (Patch Sequence + [CLS]) โ LayerNorm โ Multi-Head Self-Attention โ Residual Connection (Input + Attention Output) โ LayerNorm โ MLP (2 Linear + GELU) โ Residual Connection (Input + MLP Output) Output (Passed to the next block or the final result)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
---
#### ๐ฏ Key Role of the Transformer Encoder
| Component | Role |
|:---|:---|
| MSA (Self-Attention) | Learns the relationships between patches (captures global context) |
| MLP | Transforms the characteristics of each token non-linearly |
| LayerNorm | Improves learning stability |
| Residual Connection | Preserves information flow and stabilizes learning |
---
#### ๐ง Significance of the Transformer Encoder in ViT
- Unlike CNNs that process primarily based on local patterns,
- The Transformer Encoder **models global patch-to-patch relationships all at once**.
- In particular, the [CLS] token learns to summarize the entire image during this process.
---
### 6. Classification Head
- Predicts the image class by passing the final [CLS] token through an **MLP classifier**.
- This is similar to CNN classification with ResNet, right!? The 768-dimensional CLS proceeds with classification through MLP!!
---
## ๐ Exploring the Inference Process of ViT!
> Let's summarize the flow of how a trained Vision Transformer model performs classification when a new image is input through the above process!
### ๐ธ 1. Divide the image into patches
- The input image is divided into small pieces of **16ร16 pixels**.
- For example, a 224ร224 image is divided into a total of **196 patches**.
### ๐ 2. Combine CLS token with patch embeddings
- The **[CLS] embedding vector**, prepared during the training process, is added to the very beginning.
- Therefore, the input sequence consists of a total of **197 vectors**.
(1 [CLS] + 196 patches)
### ๐ง 3. Input to Transformer Encoder
- These 197 vectors are passed through the **Transformer Encoder**.
- The Transformer learns the relationships between each patch and the [CLS] through Self-Attention.
- The [CLS] comes to summarize the information of all patches.
### ๐ฏ 4. Extract CLS token
- Only the **[CLS] token at the very beginning** of the Transformer output is extracted separately.
- This vector is a representation that aggregates the overall features of the input image.
### ๐ 5. Perform classification through MLP Head
- The final [CLS] vector is input into the **MLP Head** (Multi-Layer Perceptron).
- The MLP Head receives this vector and predicts **Class probabilities**.
- (e.g., one of: cat, dog, car)
``` โ
Final ViT Inference Flow Summary
Image Input
โ
Divide into Patches
โ
Combine [CLS] + Patches
โ
Pass through Transformer Encoder
โ
Extract [CLS] Vector
โ
Classify with MLP Head
โ
Output Final Prediction Result!
The World Changed After ViTโs Emergence!!
Importance of Large-Scale Data
- ViT has weak inductive biases, so it can be weaker than CNNs on small datasets, but
- It showed performance surpassing CNNs when pre-trained on hundreds of millions or billions of images.
Computational Efficiency
- For very large models, the computational cost required for pre-training can be lower than that of CNNs.
Global Relationship Modeling
- Naturally models long-range dependencies within the image through Self-Attention.
- Since Self-Attention doesnโt care about the distance between input tokens (= image patches) (distance is an important factor in CNNs!!),
- All patches are directly connected to all other patches!!!
- Patch #1 can directly ask Patch #196, โIs this important?โ
- It doesnโt matter if they are 1, 10, or 100 pixels apart.
- In other words, distant patches can directly influence each other!
Interpretability
- By visualizing the Attention Map (like CAM!!), it becomes possible to intuitively interpret which parts of the image the model is focusing on.
๐ ViTโs Impact
- After ViTโs success, research on Transformer architectures exploded in the field of computer vision.
- Subsequently, various Vision Transformer family models (DeiT, Swin Transformer, etc.) emerged, showing excellent performance in various fields such as:
- Image Recognition (Image Classification)
- Object Detection
- Image Segmentation
- Various ViT-based models such as DINO and CLIP have emerged!!
๐ References
Vision Transformer is not just one model, it was a true paradigm shift that โushered in the Transformer era in the field of vision.โ
(Korean) ์ด๋ฏธ์ง? ๋๋ Transformer ํ ์์์ด!! - ViT์ ๋ฑ์ฅ!! (ICLR 2021)
์๋ ํ์ธ์, ์ฌ๋ฌ๋ถ! ๐
์ค๋์ ์ปดํจํฐ ๋น์ ๋ถ์ผ์์ ํ์ ์ ์ธ ์ ๊ทผ ๋ฐฉ์์ผ๋ก ์ฃผ๋ชฉ๋ฐ๊ณ ์๋
๋น์ ํธ๋์คํฌ๋จธ(Vision Transformer, ViT)์ ๋ํ์ฌ ํจ๊ป ์์๋ณด๊ฒ ์ต๋๋ค~!
๐ฐ๏ธ ViT ๋ฑ์ฅ ์ ์ AI ๋ถ์ผ์ ํํฉ
- ํ ์คํธ ๋ถ์์์๋ ๋จ์ด ์ ์ฒ๋ฆฌ, TF-IDF ๊ธฐ๋ฐ DTM(Document-Term Matrix), BoW, Word2Vec ๋ฑ์ด ๋๋ฆฌ ์ฌ์ฉ๋๊ณ ์์๋ค.
- ๊ทธ๋ฌ๋ ์ค โAttention is All You Needโ ๋ ผ๋ฌธ์ด ๋ฑ์ฅํ๋ฉฐ, Transformer ๊ธฐ๋ฐ์ ํ์ ์ด ์์๋์๋ค!
- BERT, GPT ๋ฑ ๋ค์ํ ์ธ์ด๋ชจ๋ธ๋ค์ด ๋น ๋ฅด๊ฒ ๋ฑ์ฅํ๊ณ ํ ์คํธ AI ์์ญ์ด ๊ธ์๋๋ก ๋ฐ์ ํ๋ค.
๊ทธ๋ฐ๋ฐ ์ด๋ฏธ์ง ๋ถ์์?
์ฌ์ ํ ํฝ์ ๋ฐ์ดํฐ๋ฅผ CNN์ ์ ๋ ฅํ๋ ๋ฐฉ์.
2015๋ ๋ฑ์ฅํ ResNet ๊ธฐ๋ฐ ์ฐ๊ตฌ๊ฐ ์ฃผ๋ฅผ ์ด๋ค์ผ๋ฉฐ,
์ด๋ฏธ์ง ์ ์ฒด์ ์ ์ญ์ ์ธ ๊ด๊ณ๋ฅผ ๋ชจ๋ธ๋งํ๋ ๋ฐ๋ ํ๊ณ๊ฐ ์์๋ค.
๐ก ViT์ ๋ฑ์ฅ!!
์ด๋ฏธ์ง๋ ํ ์คํธ์ฒ๋ผ ํ ํฐํํด์ ์ดํ ์ ๋ชจ๋ธ์ ๋ฃ์ด๋ณด์!
- ๊ธฐ์กด ํ ์คํธ Transformer๋ ๋ฌธ์ฅ์ ๋จ์ด ํ๋ํ๋๋ฅผ ํ ํฐ์ผ๋ก ๋ถ์ํ๋ค.
- ViT๋ ์ด๋ฏธ์ง๋ฅผ 16ร16 ํจ์น๋ก ์ชผ๊ฐ์ด, ๊ฐ ํจ์น๋ฅผ ํ ํฐ์ฒ๋ผ ๋ค๋ฃจ์๋ค.
- ์ด ํจ์น๋ค์ Transformer์ ์ ๋ ฅํ์ฌ ์ด๋ฏธ์ง๋ฅผ ๋ถ๋ฅ(Classification) ํ๋ ๋ฐฉ์์ ๋์ ํ๋ค.
๐ ViT์ ์ธ๋ถ ๊ตฌ์กฐ ์ดํดํ๊ธฐ
1. ์ด๋ฏธ์ง ํจ์นญ (Image Patching)
- ์ ๋ ฅ ์ด๋ฏธ์ง๋ฅผ ๊ณ ์ ํฌ๊ธฐ์ ํจ์น๋ก ๋ถํ .
- ์: 224ร224 ์ด๋ฏธ์ง๋ฅผ 16ร16 ํฌ๊ธฐ์ ํจ์น๋ก ์ชผ๊ฐ๋ฉด ์ด 196๊ฐ ํจ์น ์์ฑ.
- 196๊ฐ์ ํจ์น ๊ฐ๊ฐ์ 16 X 16 X 3(์ปฌ๋ฌ์ด๋ฏธ์ง๋ RGB๋๊น 3์ฐจ์์ ์ ๋ณด) ์ ์ฐจ์๊ฐ์ ๋ด๊ณ ์์!
2. ์ ํ ์๋ฒ ๋ฉ (Linear Projection)
- ๊ฐ ํจ์น๋ฅผ ์ ํ ๋ ์ด์ด๋ฅผ ํตํด ๊ณ ์ ๋ ์ฐจ์์ ๋ฒกํฐ๋ก ๋ณํ.
- ํ ์คํธ์์ ๋จ์ด ์๋ฒ ๋ฉ์ ๋ง๋๋ ๊ฒ๊ณผ ์ ์ฌ.
- ์ ๋จ๊ณ์์ 196๊ฐ์ ํจ์น ๊ฐ๊ฐ์ 1X768 (768=16X16X3)์ ์ผ์ฐจ์ ๋ฒกํฐ๋ก ๋ฐ๊พธ๋๊ฒ์!!
3. ์์น ์ธ์ฝ๋ฉ (Positional Encoding) - ๋ชจ๋ธ๋ง๋ค ๊ฐ๊ธฐ ๋ค๋ฅธ ๋ฐฉ๋ฒ ์ฌ์ฉ
Transformer๋ ์์๋ฅผ ์ธ์ํ์ง ๋ชปํ๊ธฐ ๋๋ฌธ์,
๊ฐ ํจ์น์ ์์น ์ ๋ณด๋ฅผ ๋ํด ํจ์น ๊ฐ ์์๋ฅผ ํ์ตํ ์ ์๋๋ก ์ง์.
๐ง ํต์ฌ ์์ฝ
๊ตฌ๋ถ | ์ค๋ช |
---|---|
์ ํ์? | Transformer๋ ์์๋ฅผ ์ธ์ํ ์ ์๊ธฐ ๋๋ฌธ |
์ด๋ป๊ฒ? | ํจ์น์ ์์น ์ ๋ณด๋ฅผ ๋ํด์ค |
๋ฐฉ์ | ์ฌ์ธยท์ฝ์ฌ์ธ ํจ์ ๊ธฐ๋ฐ (๊ณ ์ ) / ํ์ต ๊ฐ๋ฅํ ์๋ฒ ๋ฉ (ViT) |
๊ฒฐ๊ณผ | ํจ์น ๊ฐ ์๋์ , ์ ๋์ ์์น ์ ๋ณด๋ฅผ ๋ชจ๋ธ์ด ์ดํดํ ์ ์๊ฒ ๋จ |
๐ ์ ํ์ํ๊ฐ?
- Transformer๋ Self-Attention ๋ฉ์ปค๋์ฆ์ ์ฌ์ฉํ์ง๋ง,
์์ฒด์ ์ผ๋ก ์ ๋ ฅ ์์(ํจ์น ์์)๋ฅผ ์ธ์ํ ๋ฅ๋ ฅ์ ์๋ค. - ๋ฐ๋ผ์ ์ ๋ ฅ ํ ํฐ(ํ ์คํธ ๋จ์ด๋ , ์ด๋ฏธ์ง ํจ์น๋ )์ด ์๋ก ์ด๋ค ์์์ ์๋์ง ์๋ ค์ฃผ๋ ์ ๋ณด๊ฐ ํ์ํ๋ค.
์ด๋ฏธ์ง์์๋ ํจ์น๋ค์ ๋จ์ํ ๋ชจ์ฌ ์๋ ๊ฒ ์๋๋ผ,
์ข์ฐ, ์ํ, ์ฃผ๋ณ ๊ด๊ณ๋ฅผ ํตํด ์๋ฏธ๋ฅผ ๊ฐ์ง๋ค.
๋ฐ๋ผ์ โ์ด๋์ ์๋ ํจ์น์ธ๊ฐ?โ๋ฅผ ๋ชจ๋ธ์ด ์ ์ ์์ด์ผ ํ๋ค.
๐ ๏ธ ์ด๋ป๊ฒ ํ๋๊ฐ?
- ๊ฐ ํจ์น ๋ฒกํฐ์ Positional Encoding ๋ฒกํฐ๋ฅผ ๋ํ๊ฑฐ๋ ํฉ์ณ์ ์ ๋ ฅ์ ํฌํจ์ํจ๋ค.
- ๋ํ์ ์ธ ๋ฐฉ์์ ๋ ๊ฐ์ง๋ค:
- ์ฌ์ธยท์ฝ์ฌ์ธ ๊ธฐ๋ฐ ์ ์ (Positional Encoding)
- ์ผ์ ์ํ ํจ์(์ฌ์ธ, ์ฝ์ฌ์ธ)๋ฅผ ๊ธฐ๋ฐ์ผ๋ก ํจํด์ ์์ฑ.
- Transformer ๋ ผ๋ฌธ โAttention Is All You Needโ์์ ์ฌ์ฉ๋จ.
- ํ์ต ๊ฐ๋ฅํ(learnable) ์์น ์๋ฒ ๋ฉ
- ์์น๋ณ๋ก ๋ฒกํฐ๋ฅผ ์์ฑํ๊ณ , ์ด ๋ฒกํฐ๋ฅผ ํ์ต ๊ณผ์ ์์ ํจ๊ป ํ์ตํจ.
- ViT์์๋ ํ์ต ๊ฐ๋ฅํ ์์น ์๋ฒ ๋ฉ์ ์ฃผ๋ก ์ฌ์ฉํ๋ค.
- ์ฌ์ธยท์ฝ์ฌ์ธ ๊ธฐ๋ฐ ์ ์ (Positional Encoding)
โก๏ธ ๋๋ถ์ ๋ชจ๋ธ์ โ์ด ํจ์น๊ฐ ์ด๋ ์์น์ ์๋์งโ์ ๋ํ ์ ๋ณด๋ฅผ ํ์ตํ ์ ์๊ณ ,
Attention์ด ๋จ์ํ ๊ฐ๋ง ๋ณด์ง ์๊ณ , โ๊ณต๊ฐ์ ๋งฅ๋ฝโ๊น์ง ๊ณ ๋ คํ๊ฒ ๋๋ค!
4. ํด๋์ค ํ ํฐ (Class Token) : ํจ์น ์์ ํ์ค์์ฝ์ ๋ณด๋ฅผ ์ถ๊ฐํจ!!!
์ ์ฒด ์ด๋ฏธ์ง๋ฅผ ๋ํํ๋ [CLS] ํ ํฐ์ ์ถ๊ฐํฉ๋๋ค!
๊ทธ๋์, 16ร16 ํฌ๊ธฐ์ ํจ์น๋ก ์ชผ๊ฐ๋ฉด ์ด 196๊ฐ ํจ์น์์ [CLS] ๊ฐ ๋ค์ด๊ฐ์, ํจ์น๋ ์ด 197๊ฐ๊ฐ ๋ฉ๋๋ค! ์ด๋, CLS๋ ๋ค๋ฅธ ํจ์น์ ๋์ผํ๊ฒ 768๊ฐ ์์ผ๋ก ๊ตฌ์ฑ๋๊ฒ ์ง์!? ์ด CLS ํ ํฐ์ด ์ต์ข ์ ์ผ๋ก ์ด๋ฏธ์ง์ ๋ถ๋ฅ ๊ฒฐ๊ณผ๋ฅผ ๋ํํ๊ฒ๋๋ค!!
๐ ๋ฌด์์ธ๊ฐ?
- [CLS] ํ ํฐ(Classification Token) ์ Transformer ์ ๋ ฅ ์ํ์ค์ ๋งจ ์์ ์ถ๊ฐํ๋ ํน๋ณํ ํ ํฐ์ด๋ค.
- ์ด [CLS] ํ ํฐ์ ์ฒ์๋ถํฐ ๋ชจ๋ธ์ด ํ์ตํ ์ ์๋ ๊ฐ์ค์น ๋ฒกํฐ๋ก ์ด๊ธฐํ๋๋ค.
- ๋ชฉ์ :
์ ๋ ฅ๋ ์ด๋ฏธ์ง(๋๋ ํ ์คํธ) ์ ์ฒด์ ๋ํ ์์ฝ ์ ๋ณด๋ฅผ ๋ด์,
์ต์ข ๋ถ๋ฅ(Classification) ๊ฒฐ๊ณผ๋ฅผ ์ป๊ธฐ ์ํด ์ฌ์ฉ๋๋ค.
๐ ๏ธ ์ด๋ป๊ฒ ๋์ํ์ง!?
- ์ ๋ ฅ ์ด๋ฏธ์ง ํจ์น๋ค์ ๋๋ ๋ค, ๊ฐ ํจ์น๋ฅผ ์๋ฒ ๋ฉํ๋ค.
- ์ด ์๋ฒ ๋ฉ ์ํ์ค ๋งจ ์์ [CLS] ์๋ฒ ๋ฉ ๋ฒกํฐ๋ฅผ ์ถ๊ฐํ๋ค.
- ์ ์ฒด ์ํ์ค([CLS] + ํจ์น๋ค)๋ฅผ Transformer ์ธ์ฝ๋์ ์ ๋ ฅํ๋ค.
- Transformer ์ธ์ฝ๋๋ฅผ ๊ฑฐ์น ํ,
- [CLS] ํ ํฐ์ ์์ ์ ์ค์ฌ์ผ๋ก ์ ์ฒด ํจ์น๋ค๊ณผ ์ํธ์์ฉ(Self-Attention)ํ๋ฉฐ ์ ๋ณด๋ฅผ ๋ชจ์๋ค.
- ์ต์ข
์ ์ผ๋ก Transformer ์ถ๋ ฅ ์ค [CLS] ํ ํฐ๋ง์ ๊ฐ์ ธ์
Classification Head (MLP Layer)์ ๋ฃ์ด ์ต์ข ์์ธก๊ฐ์ ๋ง๋ ๋ค.
์ฝ๊ฒ ๋งํ๋ฉด:
โCLS ํ ํฐ = ์ด ์ด๋ฏธ์ง์ ๋ํ ์์ฝ ๋ฆฌํฌํฐโ
๋ชจ๋ธ์ด ํจ์น๋ค์ ํน์ฑ์ ๋ชจ์ [CLS] ํ ํฐ ํ๋์ ์์ฝํด๋๋ ๊ตฌ์กฐ๋ค.
๐ฏ ์ ํ์ํ ๊น??
- Transformer๋ ์ํ์ค ์ ์ฒด๋ฅผ ์ฒ๋ฆฌํ๊ธด ํ์ง๋ง,
์ถ๋ ฅ ์ญ์ ๊ฐ ํ ํฐ๋ณ๋ก ๋ฐ๋ก ๋์จ๋ค. - ์ฐ๋ฆฌ๋ โ์ด ์ ์ฒด ์
๋ ฅ(์ด๋ฏธ์ง)์ด ์ด๋ค ํด๋์ค์ ์ํ๋?โ๋ฅผ ์์์ผ ํ๋ฏ๋ก,
๋ชจ๋ ์ ๋ณด๋ฅผ ๋ํํ๋ ํ๋์ ๋ฒกํฐ๊ฐ ํ์ํ๋ค. - [CLS] ํ ํฐ์ด ๋ฐ๋ก ๊ทธ ์ญํ ์ ํ๋ค.
๐ง ViT์์ [CLS] ํ ํฐ์ ํน์ง
- [CLS] ํ ํฐ์ ํ์ต ๊ฐ๋ฅํ ์ด๊ธฐ๋๋ค๊ฐ(random initialized)์ผ๋ก ์์๋๋ค.
- Training ๋์ ๋ค๋ฅธ ํจ์น๋ค๊ณผ Attention์ ์ฃผ๊ณ ๋ฐ์ผ๋ฉฐ ์ ์ ๋ โ์ด๋ฏธ์ง ์ ์ฒด๋ฅผ ์์ฝํ๋โ ๋ฒกํฐ๋ก ์งํํ๋ค.
- ์ด ์ต์ข ๋ฒกํฐ๋ฅผ ํตํด ์ด๋ฏธ์ง ๋ถ๋ฅ, ํน์ฑ ์ถ์ถ, ๋ค์ด์คํธ๋ฆผ ํ์คํฌ ๋ฑ์ ํ์ฉํ ์ ์๋ค.
โจ ํต์ฌ ์์ฝ
๊ตฌ๋ถ | ์ค๋ช |
---|---|
์ญํ | ์ ์ฒด ์ ๋ ฅ(์ด๋ฏธ์ง)์ ๋ํํ๋ ์์ฝ ์ ๋ณด ์ ์ฅ |
๋ฐฉ์ | ์ ๋ ฅ ํจ์น ์์ [CLS] ํ ํฐ ์ถ๊ฐ ํ, Attention์ ํตํด ์ ๋ณด ์ง์ฝ |
๊ฒฐ๊ณผ ์ฌ์ฉ | ์ต์ข ๋ถ๋ฅ ๊ฒฐ๊ณผ (Classification Head์ ์ฐ๊ฒฐ) |
5. ํธ๋์คํฌ๋จธ ์ธ์ฝ๋ (Transformer Encoder)
ViT์ Transformer Encoder๋ โ์ด๋ฏธ์ง ์ ์ฒด๋ฅผ ๋ณด๋ ๋ฅ๋ ฅโ ์ ์์ฐ์ค๋ฝ๊ฒ ๋ชจ๋ธ์ ๋ถ์ฌํ๋ ํต์ฌ ๋ชจ๋!!
๐ ๏ธ Transformer Encoder ๋ธ๋ก ๊ตฌ์ฑ ์์
Transformer Encoder ํ๋๋ ํฌ๊ฒ ๋ ๊ฐ์ง ๋ชจ๋๋ก ๊ตฌ์ฑ๋๋ค:
- Multi-Head Self-Attention (MSA)
- Multi-Layer Perceptron (MLP)
์ด ๋ ๋ธ๋ก ์ฌ์ด์๋ ํญ์ Layer Normalization๊ณผ Residual Connection(์์ฐจ ์ฐ๊ฒฐ) ์ด ์ถ๊ฐ๋๋ค.
๐งฉ ์์ธํ ๊ตฌ์กฐ์ ํ๋ฆ
LayerNorm
์ ๋ ฅ ์ํ์ค(ํจ์น + [CLS])์ ๋ํด ๋จผ์ ์ ๊ทํ ์ํ.- Multi-Head Self-Attention (MSA)
- ๊ฐ๊ฐ์ ํ ํฐ์ด ๋ค๋ฅธ ๋ชจ๋ ํ ํฐ๊ณผ ์ํธ์์ฉํ ์ ์๊ฒ ํ๋ค.
- ์ฌ๋ฌ ๊ฐ์ Attention Head๊ฐ ๋ณ๋ ฌ๋ก ๋์ํ์ฌ ๋ค์ํ ๊ด๊ณ๋ฅผ ํฌ์ฐฉ.
- Self-Attention์ ํตํด, ์ด๋ฏธ์ง ์ ์ฒด ํจ์น ๊ฐ ๊ธ๋ก๋ฒ(Global) ์ปจํ ์คํธ๋ฅผ ํ์ตํ๋ค.
- Residual Connection
- Attention ๊ฒฐ๊ณผ๋ฅผ ์ ๋ ฅ์ ๋ํด์ค.
- ํ์ต์ด ๋ ์์ ๋๊ณ , Gradient Vanishing ๋ฌธ์ ๋ฅผ ์ํ.
- LayerNorm
- ๋ค์ ํ๋ฒ ์ ๊ทํ ์ํ.
- Multi-Layer Perceptron (MLP)
- ๋ ๊ฐ์ Linear Layer(์ ํ ๋ณํ)์ ์ค๊ฐ์ ํ์ฑํ ํจ์(GELU)๋ฅผ ํฌํจ.
- ๊ฐ ํ ํฐ(ํจ์น)์ feature representation์ ๋ ๋ณต์กํ๊ฒ ๋ณํ.
- Residual Connection
- MLP ๊ฒฐ๊ณผ๋ฅผ ์ ๋ ฅ์ ๋ํด์ค.
๐ ์ ์ฒด ๋ธ๋ก ํ๋ก์ฐ
1
2
3
4
5
6
7
8
์
๋ ฅ (ํจ์น ์ํ์ค + [CLS])
โ LayerNorm
โ Multi-Head Self-Attention
โ Residual Connection (์
๋ ฅ + Attention ์ถ๋ ฅ)
โ LayerNorm
โ MLP (2 Linear + GELU)
โ Residual Connection (์
๋ ฅ + MLP ์ถ๋ ฅ)
์ถ๋ ฅ (๋ค์ ๋ธ๋ก ๋๋ ์ต์ข
๊ฒฐ๊ณผ๋ก ์ ๋ฌ)
๐ฏ ํธ๋์คํฌ๋จธ ์ธ์ฝ๋์ ํต์ฌ ์ญํ
๊ตฌ์ฑ ์์ | ์ญํ |
---|---|
MSA (Self-Attention) | ํจ์น๋ค ๊ฐ์ ๊ด๊ณ๋ฅผ ํ์ต (๊ธ๋ก๋ฒ ์ปจํ ์คํธ ํฌ์ฐฉ) |
MLP | ๊ฐ ํ ํฐ์ ํน์ฑ์ ๋น์ ํ์ ์ผ๋ก ๋ณํ |
LayerNorm | ํ์ต ์์ ์ฑ ํฅ์ |
Residual Connection | ์ ๋ณด ํ๋ฆ์ ๋ณด์กดํ๊ณ , ํ์ต ์์ ํ |
๐ง ViT์์ ํธ๋์คํฌ๋จธ ์ธ์ฝ๋์ ์๋ฏธ
- CNN์ด ์ง์ญ์ (local) ํจํด์ ์ค์ฌ์ผ๋ก ์ฒ๋ฆฌํ๋ ๊ฒ๊ณผ ๋ฌ๋ฆฌ,
- ํธ๋์คํฌ๋จธ ์ธ์ฝ๋๋ ์ ์ญ(Global) ํจ์น ๊ฐ ๊ด๊ณ๋ฅผ ํ ๋ฒ์ ๋ชจ๋ธ๋งํ๋ค.
- ํนํ [CLS] ํ ํฐ์ ์ด ๊ณผ์ ์์์ ์ ์ฒด ์ด๋ฏธ์ง๋ฅผ ์์ฝํ๋ ์ญํ ์ ํ๋๋ก ํ์ต๋๋ค.
6. ๋ถ๋ฅ ํค๋ (Classification Head)
- ์ต์ข [CLS] ํ ํฐ์ MLP ๋ถ๋ฅ๊ธฐ์ ํต๊ณผ์์ผ ์ด๋ฏธ์ง ํด๋์ค๋ฅผ ์์ธก.
- ์ฌ๊ธฐ์๋ ๊ธฐ์กด resnet ๋ฑ CNN ๋ถ๋ฅ์ ๋ง์ฐฌ๊ฐ์ง์ง์!? 768์ฐจ์์ CLS๊ฐ MLP๋ฅผ ํตํด classification์ ์งํํฉ๋๋ค!!
๐ ViT์ ์ถ๋ก (inference) ๊ณผ์ ์ดํด๋ณด๊ธฐ!
์ ๊ณผ์ ์ ํตํด ํ์ต๋ Vision Transformer ๋ชจ๋ธ์
์๋ก์ด ์ด๋ฏธ์ง๊ฐ ์ ๋ ฅ๋์์ ๋ ์ด๋ค ํ๋ฆ์ผ๋ก ๋ถ๋ฅ๊ฐ ์ด๋ฃจ์ด์ง๋์ง ์ ๋ฆฌํด๋ด ์๋ค!
๐ธ 1. ์ด๋ฏธ์ง๋ฅผ ํจ์น๋ก ์ชผ๊ฐ๊ธฐ
- ์ ๋ ฅ ์ด๋ฏธ์ง๋ฅผ 16ร16 ํฌ๊ธฐ์ ์์ ์กฐ๊ฐ๋ค๋ก ๋๋๋๋ค.
- ์๋ฅผ ๋ค์ด 224ร224 ์ด๋ฏธ์ง๋ ์ด 196๊ฐ ํจ์น๋ก ๋ถํ ๋ฉ๋๋ค.
๐ 2. CLS ํ ํฐ๊ณผ ํจ์น ์๋ฒ ๋ฉ ๊ฒฐํฉ
- ํ์ต ๊ณผ์ ์์ ์ค๋น๋ [CLS] ์๋ฒ ๋ฉ ๋ฒกํฐ๋ฅผ ๊ฐ์ฅ ์์ ์ถ๊ฐํฉ๋๋ค.
- ๋ฐ๋ผ์ ์
๋ ฅ ์ํ์ค๋ ์ด 197๊ฐ ๋ฒกํฐ๋ก ๊ตฌ์ฑ๋ฉ๋๋ค.
(1๊ฐ [CLS] + 196๊ฐ ํจ์น)
๐ง 3. Transformer Encoder์ ์ ๋ ฅ
- ์ด 197๊ฐ์ ๋ฒกํฐ๋ฅผ Transformer Encoder์ ํต๊ณผ์ํต๋๋ค.
- Transformer๋ ๊ฐ ํจ์น์ [CLS] ๊ฐ์ ๊ด๊ณ๋ฅผ Self-Attention์ผ๋ก ํ์ตํฉ๋๋ค.
- [CLS]๋ ๋ชจ๋ ํจ์น๋ค์ ์ ๋ณด๋ฅผ ์์ฝํ๊ฒ ๋ฉ๋๋ค.
๐ฏ 4. CLS ํ ํฐ ์ถ์ถ
- Transformer ์ถ๋ ฅ ๊ฒฐ๊ณผ ์ค ๋งจ ์์ ์๋ [CLS] ํ ํฐ๋ง ๋ฐ๋ก ๊บผ๋ ๋๋ค.
- ์ด ๋ฒกํฐ๋ ์ ๋ ฅ ์ด๋ฏธ์ง์ ์ ์ฒด ํน์ง์ ์ง์ฝํ ํํ์ ๋๋ค.
๐ 5. MLP Head๋ฅผ ํตํด ๋ถ๋ฅ ์ํ
- ์ต์ข [CLS] ๋ฒกํฐ๋ฅผ MLP Head (Multi-Layer Perceptron)์ ์ ๋ ฅํฉ๋๋ค.
- MLP Head๋ ์ด ๋ฒกํฐ๋ฅผ ๋ฐ์ ํด๋์ค(Class) ํ๋ฅ ์ ์์ธกํฉ๋๋ค.
- (์: ๊ณ ์์ด, ๊ฐ์์ง, ์๋์ฐจ ์ค ํ๋)
โ ์ต์ข ViT์ ์ถ๋ก ํ๋ฆ ์์ฝ
1
2
3
4
5
6
7
8
9
10
11
12
13
์ด๋ฏธ์ง ์
๋ ฅ
โ
ํจ์น๋ก ๋ถํ
โ
[CLS] + ํจ์น๋ค ํฉ์น๊ธฐ
โ
Transformer Encoder ํต๊ณผ
โ
[CLS] ๋ฒกํฐ ์ถ์ถ
โ
MLP Head๋ก ๋ถ๋ฅ
โ
์ต์ข
์์ธก ๊ฒฐ๊ณผ ์ถ๋ ฅ!
ViT ๋ฑ์ฅ ์ดํ, ๋ฐ๋์ด ๋ฒ๋ฆฐ ์ธ์!!
๋๊ท๋ชจ ๋ฐ์ดํฐ์ ์ค์์ฑ
- ViT๋ ์์์ ์ง์์ด ์ฝํ๊ธฐ ๋๋ฌธ์, ์์ ๋ฐ์ดํฐ์ ์์๋ CNN๋ณด๋ค ์ฝํ ์ ์์ง๋ง,
- ์์ต ์ฅ ์ด์์ ๋๊ท๋ชจ ๋ฐ์ดํฐ๋ก ์ฌ์ ํ์ต(pretraining) ์ CNN์ ๋ฅ๊ฐํ๋ ์ฑ๋ฅ์ ๋ณด์์ต๋๋ค.
๊ณ์ฐ ํจ์จ์ฑ
- ๋งค์ฐ ํฐ ๋ชจ๋ธ์ ๊ฒฝ์ฐ, CNN์ ๋นํด ์ฌ์ ํ์ต์ ํ์ํ ๊ณ์ฐ๋์ด ๋ ์ ์ ์ ์์.
์ ์ญ ๊ด๊ณ ๋ชจ๋ธ๋ง
- Self-Attention์ ํตํด ์ด๋ฏธ์ง ๋ด ์ฅ๊ฑฐ๋ฆฌ ์์กด์ฑ(long-range dependencies) ์ ์์ฐ์ค๋ฝ๊ฒ ๋ชจ๋ธ๋ง.
- Self-Attention์ ์ ๋ ฅ ํ ํฐ(= ์ด๋ฏธ์ง ํจ์น๋ค) ๊ฐ ๊ฑฐ๋ฆฌ๋ฅผ ์ ๊ฒฝ ์ฐ์ง ์๊ธฐ์, (CNN์ ์ด๋ฏธ์ง ๋ด์ ๊ฑฐ๋ฆฌ๊ฐ ์ค์ํ์์!!)
- ๋ชจ๋ ํจ์น๊ฐ ๋ชจ๋ ํจ์น์ ๋ฐ๋ก ์ง์ ์ฐ๊ฒฐ๋ฉ๋๋ค~!!!
- ํจ์น 1๋ฒ์ด ํจ์น 196๋ฒ๊น์ง ๋ฐ๋ก โ์ ์ค์ํด?โ ํ๊ณ ๋ฌผ์ด๋ณผ ์ ์์ด์.
- ๊ฑฐ๋ฆฌ 1์นธ, 10์นธ, 100์นธ ๋จ์ด์ ธ ์์ด๋ ์๊ด์์ด์.
- ์ฆ, ๋ฉ๋ฆฌ ๋จ์ด์ง ํจ์น๋ผ๋ฆฌ๋ ๊ณง๋ฐ๋ก ์ํฅ์ ์ฃผ๊ณ ๋ฐ์ ์ ์๋ค!
ํด์ ๊ฐ๋ฅ์ฑ
- Attention Map์ ์๊ฐํํ์ฌ (CAM์ฒ๋ผ!!)
๋ชจ๋ธ์ด ์ด๋ฏธ์ง์ ์ด๋ค ๋ถ๋ถ์ ์ฃผ๋ชฉํ๋์ง ์ง๊ด์ ์ผ๋ก ํด์ํ ์ ์๊ฒ ๋จ.
๐ ViT์ ์ํฅ๋ ฅ
- ViT์ ์ฑ๊ณต ์ดํ, ์ปดํจํฐ ๋น์ ๋ถ์ผ์์ Transformer ์ํคํ ์ฒ ์ฐ๊ตฌ๊ฐ ํญ๋ฐ์ ์ผ๋ก ์ฆ๊ฐํ๋ค.
- ์ดํ ๋ค์ํ ๋น์ ํธ๋์คํฌ๋จธ ๊ณ์ด ๋ชจ๋ธ (DeiT, Swin Transformer ๋ฑ)์ด ๋ฑ์ฅํ์ฌ:
- ์ด๋ฏธ์ง ์ธ์ (Image Classification)
- ๊ฐ์ฒด ๊ฒ์ถ (Object Detection)
- ์ด๋ฏธ์ง ๋ถํ (Image Segmentation) ๋ฑ ์ฌ๋ฌ ๋ถ์ผ์์ ๋ฐ์ด๋ ์ฑ๋ฅ์ ๋ณด์ฌ์ฃผ๊ณ ์๋ค.
- DINO, CLIP ๋ฑ ๋ค์ํ ViT ๊ธฐ๋ฐ์ ๋ชจ๋ธ๋ค์ด ๋์์ต๋๋ค!!
๐ ์ฐธ๊ณ
Vision Transformer๋ ๋จ์ํ ๋ชจ๋ธ ํ๋๋ฅผ ๋์ด,
โ๋น์ ๋ถ์ผ์์๋ Transformer ์๋๋ฅผ ์ด์ด์ ํโ ์ง์ ํ ํจ๋ฌ๋ค์ ์ ํ์ด์์ต๋๋ค.