Post

๐Ÿ“ Image? You Can Do Transformer Too!! - The Emergence of ViT!! - ์ด๋ฏธ์ง€? ๋„ˆ๋„ Transformer ํ• ์ˆ˜์žˆ์–ด!! - ViT์˜ ๋“ฑ์žฅ!! (ICLR 2021)

๐Ÿ“ Image? You Can Do Transformer Too!! - The Emergence of ViT!! - ์ด๋ฏธ์ง€? ๋„ˆ๋„ Transformer ํ• ์ˆ˜์žˆ์–ด!! - ViT์˜ ๋“ฑ์žฅ!! (ICLR 2021)

Image? You Can Do Transformer Too!! - The Emergence of ViT!! (ICLR 2021)

Hello everyone! ๐Ÿ‘‹

Today, letโ€™s explore the Vision Transformer (ViT), a revolutionary approach in computer vision thatโ€™s garnering significant attention!

manhwa

๐Ÿ•ฐ๏ธ The State of AI Before ViTโ€™s Emergence

  • In text analysis, word preprocessing, TF-IDF based DTM (Document-Term Matrix), BoW, and Word2Vec were widely used.
  • Then, the paper โ€œAttention is All You Needโ€ emerged, marking the beginning of Transformer-based innovation!
  • Various language models like BERT and GPT quickly appeared, leading to the rapid development of the text AI field.

But what about image analysis? Still relying on the method of inputting pixel data into CNNs. Research based on ResNet, which appeared in 2015, was dominant. There was a limitation in modeling the global relationships of the entire image.


๐Ÿ’ก The Emergence of ViT!!

paper

What if we tokenize images like text and put them into an attention model?

  • Existing text Transformers analyzed sentences by treating each word as a token.
  • ViT divided the image into 16ร—16 patches, treating each patch as a token.
  • It introduced a method of classifying images (Classification) by inputting these patches into a Transformer.

๐Ÿ” Understanding the Detailed Structure of ViT

structure

1. Image Patching

  • Divides the input image into fixed-size patches.
  • Example: Splitting a 224ร—224 image into 16ร—16 patches generates a total of 196 patches.
  • Each of the 196 patches contains dimension values of 16 X 16 X 3 (since a color image has 3 color channels: RGB)!

2. Linear Projection

  • Transforms each patch into a fixed-dimensional vector through a linear layer.
  • Similar to creating word embeddings in text.
  • In the previous step, each of the 196 patches is converted into a 1X768 (768=16X16X3) one-dimensional vector!!

3. Positional Encoding - Different methods used for each model

Since Transformer does not recognize order, positional information is added to each patch to support learning the order between patches.

๐Ÿง  Key Summary

CategoryDescription
Why needed?Because Transformer cannot recognize order
How?By adding positional information to patches
MethodSine/cosine function based (fixed) / Learnable embeddings (ViT)
ResultThe model can understand the relative and absolute positional information between patches

๐Ÿ“Œ Why is it needed?

  • Transformer uses the Self-Attention mechanism, but it inherently lacks the ability to recognize the input order (patch order).
  • Therefore, information is needed to tell the model what the order is between the input tokens (whether text words or image patches).

Even in images, patches are not just gathered together, they have meaning through their left-right, top-bottom, and surrounding relationships. Therefore, the model needs to know โ€œwhere is this patch located?โ€.

๐Ÿ› ๏ธ How is it done?

  • A Positional Encoding vector is added to or combined with each patch vector to include it in the input.
  • There are two main methods:
    1. Sine/cosine based static (Positional Encoding)
      • Generates patterns based on certain mathematical functions (sine, cosine).
      • Used in the Transformer paper โ€œAttention Is All You Needโ€.
    2. Learnable positional embeddings
      • Generates a vector for each position, and these vectors are learned together during the training process.
      • ViT primarily used learnable positional embeddings.

โžก๏ธ Thanks to this, the model can learn information about โ€œwhere this patch is located,โ€ and Attention not only looks at values but also considers โ€œspatial context!โ€

4. Class Token: Adding a one-line summary information in front of the patches!!!

A [CLS] token, representing the entire image, is added! So, if you split an image into 16ร—16 patches, a [CLS] is added in front of the total 196 patches, making the total number of patches 197! At this time, the CLS will also consist of 768 elements, just like other patches, right!? This CLS token will ultimately represent the classification result of the image!!

๐Ÿ“Œ What is it?

  • The [CLS] token (Classification Token) is a special token added to the beginning of the Transformer input sequence.
  • This [CLS] token is initialized with a learnable weight vector from the beginning.
  • Purpose: To contain summary information about the entire input image (or text), used to obtain the final classification result.

๐Ÿ› ๏ธ How does it work?

  1. After dividing the input image patches, each patch is embedded.
  2. The [CLS] embedding vector is added to the front of this embedding sequence.
  3. The entire sequence ([CLS] + patches) is input into the Transformer encoder.
  4. After passing through the Transformer encoder,
    • The [CLS] token interacts with all patches centered on itself (Self-Attention) and gathers information.
  5. Finally, only the [CLS] token from the Transformer output is taken and put into the Classification Head (MLP Layer) to produce the final prediction value.

To put it simply: โ€œ[CLS] token = a summary reporter for this imageโ€ Itโ€™s a structure where the model gathers the characteristics of the patches and summarizes them into one [CLS] token.


๐ŸŽฏ Why is it needed?

  • Although Transformer processes the entire sequence, the output also comes out separately for each token.
  • Since we need to know โ€œwhat class does this entire input (image) belong to?โ€, a single vector representing all the information is needed.
  • The [CLS] token serves exactly that purpose.

๐Ÿง  Characteristics of the [CLS] token in ViT

  • The [CLS] token starts with a learnable initial random value.
  • During training, it evolves into a vector that increasingly โ€œsummarizes the entire imageโ€ by giving and receiving Attention with other patches.
  • This final vector can be used for image classification, feature extraction, and downstream tasks.

โœจ Key Summary

CategoryDescription
RoleStores summary information representing the entire input (image)
MethodAdd [CLS] token before input patches, then aggregate information through Attention
Result UsageFinal classification result (connected to the Classification Head)

5. Transformer Encoder

ViTโ€™s Transformer Encoder is the core module that naturally endows the model with the โ€œability to see the entire imageโ€!!

๐Ÿ› ๏ธ Transformer Encoder Block Components

A single Transformer Encoder consists largely of two modules:

  1. Multi-Head Self-Attention (MSA)
  2. Multi-Layer Perceptron (MLP)

Layer Normalization and Residual Connection are always added between these two blocks.

๐Ÿงฉ Detailed Structure Flow

  1. LayerNorm Performs normalization on the input sequence (patches + [CLS]) first.

  2. Multi-Head Self-Attention (MSA)
    • Allows each token to interact with every other token.
    • Multiple Attention Heads operate in parallel to capture various relationships.
    • Learns the global context between all image patches through Self-Attention.
  3. Residual Connection
    • Adds the Attention result to the input.
    • Makes learning more stable and alleviates the Gradient Vanishing problem.
  4. LayerNorm
    • Performs normalization again.
  5. Multi-Layer Perceptron (MLP)
    • Contains two Linear Layers and an activation function (GELU) in between.
    • Transforms the feature representation of each token (patch) more complexly.
  6. Residual Connection
    • Adds the MLP result to the input.

``` ๐Ÿ”„ Overall Block Flow Input (Patch Sequence + [CLS]) โ†“ LayerNorm โ†“ Multi-Head Self-Attention โ†“ Residual Connection (Input + Attention Output) โ†“ LayerNorm โ†“ MLP (2 Linear + GELU) โ†“ Residual Connection (Input + MLP Output) Output (Passed to the next block or the final result)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
---

#### ๐ŸŽฏ Key Role of the Transformer Encoder

| Component | Role |
|:---|:---|
| MSA (Self-Attention) | Learns the relationships between patches (captures global context) |
| MLP | Transforms the characteristics of each token non-linearly |
| LayerNorm | Improves learning stability |
| Residual Connection | Preserves information flow and stabilizes learning |

---

#### ๐Ÿง  Significance of the Transformer Encoder in ViT

- Unlike CNNs that process primarily based on local patterns,
- The Transformer Encoder **models global patch-to-patch relationships all at once**.
- In particular, the [CLS] token learns to summarize the entire image during this process.

---

### 6. Classification Head

- Predicts the image class by passing the final [CLS] token through an **MLP classifier**.
- This is similar to CNN classification with ResNet, right!? The 768-dimensional CLS proceeds with classification through MLP!!

---

## ๐Ÿš€ Exploring the Inference Process of ViT!

> Let's summarize the flow of how a trained Vision Transformer model performs classification when a new image is input through the above process!

### ๐Ÿ“ธ 1. Divide the image into patches

- The input image is divided into small pieces of **16ร—16 pixels**.
- For example, a 224ร—224 image is divided into a total of **196 patches**.

### ๐Ÿ”— 2. Combine CLS token with patch embeddings

- The **[CLS] embedding vector**, prepared during the training process, is added to the very beginning.
- Therefore, the input sequence consists of a total of **197 vectors**.
  (1 [CLS] + 196 patches)

### ๐Ÿง  3. Input to Transformer Encoder

- These 197 vectors are passed through the **Transformer Encoder**.
- The Transformer learns the relationships between each patch and the [CLS] through Self-Attention.
- The [CLS] comes to summarize the information of all patches.

### ๐ŸŽฏ 4. Extract CLS token

- Only the **[CLS] token at the very beginning** of the Transformer output is extracted separately.
- This vector is a representation that aggregates the overall features of the input image.

### ๐Ÿ 5. Perform classification through MLP Head

- The final [CLS] vector is input into the **MLP Head** (Multi-Layer Perceptron).
- The MLP Head receives this vector and predicts **Class probabilities**.
  - (e.g., one of: cat, dog, car)

``` โœ… Final ViT Inference Flow Summary
Image Input
โ†“
Divide into Patches
โ†“
Combine [CLS] + Patches
โ†“
Pass through Transformer Encoder
โ†“
Extract [CLS] Vector
โ†“
Classify with MLP Head
โ†“
Output Final Prediction Result!

The World Changed After ViTโ€™s Emergence!!

Importance of Large-Scale Data

  • ViT has weak inductive biases, so it can be weaker than CNNs on small datasets, but
  • It showed performance surpassing CNNs when pre-trained on hundreds of millions or billions of images.

Computational Efficiency

  • For very large models, the computational cost required for pre-training can be lower than that of CNNs.

Global Relationship Modeling

  • Naturally models long-range dependencies within the image through Self-Attention.
  • Since Self-Attention doesnโ€™t care about the distance between input tokens (= image patches) (distance is an important factor in CNNs!!),
  • All patches are directly connected to all other patches!!!
    • Patch #1 can directly ask Patch #196, โ€œIs this important?โ€
    • It doesnโ€™t matter if they are 1, 10, or 100 pixels apart.
  • In other words, distant patches can directly influence each other!

Interpretability

  • By visualizing the Attention Map (like CAM!!), it becomes possible to intuitively interpret which parts of the image the model is focusing on.

๐ŸŒŽ ViTโ€™s Impact

  • After ViTโ€™s success, research on Transformer architectures exploded in the field of computer vision.
  • Subsequently, various Vision Transformer family models (DeiT, Swin Transformer, etc.) emerged, showing excellent performance in various fields such as:
    • Image Recognition (Image Classification)
    • Object Detection
    • Image Segmentation
  • Various ViT-based models such as DINO and CLIP have emerged!!

๐Ÿ”— References


Vision Transformer is not just one model, it was a true paradigm shift that โ€œushered in the Transformer era in the field of vision.โ€


(Korean) ์ด๋ฏธ์ง€? ๋„ˆ๋„ Transformer ํ• ์ˆ˜์žˆ์–ด!! - ViT์˜ ๋“ฑ์žฅ!! (ICLR 2021)

์•ˆ๋…•ํ•˜์„ธ์š”, ์—ฌ๋Ÿฌ๋ถ„! ๐Ÿ‘‹

์˜ค๋Š˜์€ ์ปดํ“จํ„ฐ ๋น„์ „ ๋ถ„์•ผ์—์„œ ํ˜์‹ ์ ์ธ ์ ‘๊ทผ ๋ฐฉ์‹์œผ๋กœ ์ฃผ๋ชฉ๋ฐ›๊ณ  ์žˆ๋Š”
๋น„์ „ ํŠธ๋žœ์Šคํฌ๋จธ(Vision Transformer, ViT)์— ๋Œ€ํ•˜์—ฌ ํ•จ๊ป˜ ์•Œ์•„๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค~!

manhwa

๐Ÿ•ฐ๏ธ ViT ๋“ฑ์žฅ ์ „์˜ AI ๋ถ„์•ผ์˜ ํ˜„ํ™ฉ

  • ํ…์ŠคํŠธ ๋ถ„์„์—์„œ๋Š” ๋‹จ์–ด ์ „์ฒ˜๋ฆฌ, TF-IDF ๊ธฐ๋ฐ˜ DTM(Document-Term Matrix), BoW, Word2Vec ๋“ฑ์ด ๋„๋ฆฌ ์‚ฌ์šฉ๋˜๊ณ  ์žˆ์—ˆ๋‹ค.
  • ๊ทธ๋Ÿฌ๋˜ ์ค‘ โ€œAttention is All You Needโ€ ๋…ผ๋ฌธ์ด ๋“ฑ์žฅํ•˜๋ฉฐ, Transformer ๊ธฐ๋ฐ˜์˜ ํ˜์‹ ์ด ์‹œ์ž‘๋˜์—ˆ๋‹ค!
  • BERT, GPT ๋“ฑ ๋‹ค์–‘ํ•œ ์–ธ์–ด๋ชจ๋ธ๋“ค์ด ๋น ๋ฅด๊ฒŒ ๋“ฑ์žฅํ•˜๊ณ  ํ…์ŠคํŠธ AI ์˜์—ญ์ด ๊ธ‰์†๋„๋กœ ๋ฐœ์ „ํ–ˆ๋‹ค.

๊ทธ๋Ÿฐ๋ฐ ์ด๋ฏธ์ง€ ๋ถ„์„์€?
์—ฌ์ „ํžˆ ํ”ฝ์…€ ๋ฐ์ดํ„ฐ๋ฅผ CNN์— ์ž…๋ ฅํ•˜๋Š” ๋ฐฉ์‹.
2015๋…„ ๋“ฑ์žฅํ•œ ResNet ๊ธฐ๋ฐ˜ ์—ฐ๊ตฌ๊ฐ€ ์ฃผ๋ฅผ ์ด๋ค˜์œผ๋ฉฐ,
์ด๋ฏธ์ง€ ์ „์ฒด์˜ ์ „์—ญ์ ์ธ ๊ด€๊ณ„๋ฅผ ๋ชจ๋ธ๋งํ•˜๋Š” ๋ฐ๋Š” ํ•œ๊ณ„๊ฐ€ ์žˆ์—ˆ๋‹ค.


๐Ÿ’ก ViT์˜ ๋“ฑ์žฅ!!

paper

์ด๋ฏธ์ง€๋„ ํ…์ŠคํŠธ์ฒ˜๋Ÿผ ํ† ํฐํ™”ํ•ด์„œ ์–ดํ…์…˜ ๋ชจ๋ธ์— ๋„ฃ์–ด๋ณด์ž!

  • ๊ธฐ์กด ํ…์ŠคํŠธ Transformer๋Š” ๋ฌธ์žฅ์˜ ๋‹จ์–ด ํ•˜๋‚˜ํ•˜๋‚˜๋ฅผ ํ† ํฐ์œผ๋กœ ๋ถ„์„ํ–ˆ๋‹ค.
  • ViT๋Š” ์ด๋ฏธ์ง€๋ฅผ 16ร—16 ํŒจ์น˜๋กœ ์ชผ๊ฐœ์–ด, ๊ฐ ํŒจ์น˜๋ฅผ ํ† ํฐ์ฒ˜๋Ÿผ ๋‹ค๋ฃจ์—ˆ๋‹ค.
  • ์ด ํŒจ์น˜๋“ค์„ Transformer์— ์ž…๋ ฅํ•˜์—ฌ ์ด๋ฏธ์ง€๋ฅผ ๋ถ„๋ฅ˜(Classification) ํ•˜๋Š” ๋ฐฉ์‹์„ ๋„์ž…ํ–ˆ๋‹ค.

๐Ÿ” ViT์˜ ์„ธ๋ถ€ ๊ตฌ์กฐ ์ดํ•ดํ•˜๊ธฐ

structure

1. ์ด๋ฏธ์ง€ ํŒจ์นญ (Image Patching)

  • ์ž…๋ ฅ ์ด๋ฏธ์ง€๋ฅผ ๊ณ ์ • ํฌ๊ธฐ์˜ ํŒจ์น˜๋กœ ๋ถ„ํ• .
  • ์˜ˆ: 224ร—224 ์ด๋ฏธ์ง€๋ฅผ 16ร—16 ํฌ๊ธฐ์˜ ํŒจ์น˜๋กœ ์ชผ๊ฐœ๋ฉด ์ด 196๊ฐœ ํŒจ์น˜ ์ƒ์„ฑ.
  • 196๊ฐœ์˜ ํŒจ์น˜ ๊ฐ๊ฐ์€ 16 X 16 X 3(์ปฌ๋Ÿฌ์ด๋ฏธ์ง€๋Š” RGB๋‹ˆ๊นŒ 3์ฐจ์›์˜ ์ •๋ณด) ์˜ ์ฐจ์›๊ฐ’์„ ๋‹ด๊ณ ์žˆ์Œ!

2. ์„ ํ˜• ์ž„๋ฒ ๋”ฉ (Linear Projection)

  • ๊ฐ ํŒจ์น˜๋ฅผ ์„ ํ˜• ๋ ˆ์ด์–ด๋ฅผ ํ†ตํ•ด ๊ณ ์ •๋œ ์ฐจ์›์˜ ๋ฒกํ„ฐ๋กœ ๋ณ€ํ™˜.
  • ํ…์ŠคํŠธ์—์„œ ๋‹จ์–ด ์ž„๋ฒ ๋”ฉ์„ ๋งŒ๋“œ๋Š” ๊ฒƒ๊ณผ ์œ ์‚ฌ.
  • ์ „ ๋‹จ๊ณ„์—์„œ 196๊ฐœ์˜ ํŒจ์น˜ ๊ฐ๊ฐ์„ 1X768 (768=16X16X3)์˜ ์ผ์ฐจ์› ๋ฒกํ„ฐ๋กœ ๋ฐ”๊พธ๋Š”๊ฒƒ์ž„!!

3. ์œ„์น˜ ์ธ์ฝ”๋”ฉ (Positional Encoding) - ๋ชจ๋ธ๋งˆ๋‹ค ๊ฐ๊ธฐ ๋‹ค๋ฅธ ๋ฐฉ๋ฒ• ์‚ฌ์šฉ

Transformer๋Š” ์ˆœ์„œ๋ฅผ ์ธ์‹ํ•˜์ง€ ๋ชปํ•˜๊ธฐ ๋•Œ๋ฌธ์—,
๊ฐ ํŒจ์น˜์— ์œ„์น˜ ์ •๋ณด๋ฅผ ๋”ํ•ด ํŒจ์น˜ ๊ฐ„ ์ˆœ์„œ๋ฅผ ํ•™์Šตํ•  ์ˆ˜ ์žˆ๋„๋ก ์ง€์›.

๐Ÿง  ํ•ต์‹ฌ ์š”์•ฝ

๊ตฌ๋ถ„์„ค๋ช…
์™œ ํ•„์š”?Transformer๋Š” ์ˆœ์„œ๋ฅผ ์ธ์‹ํ•  ์ˆ˜ ์—†๊ธฐ ๋•Œ๋ฌธ
์–ด๋–ป๊ฒŒ?ํŒจ์น˜์— ์œ„์น˜ ์ •๋ณด๋ฅผ ๋”ํ•ด์คŒ
๋ฐฉ์‹์‚ฌ์ธยท์ฝ”์‚ฌ์ธ ํ•จ์ˆ˜ ๊ธฐ๋ฐ˜ (๊ณ ์ •) / ํ•™์Šต ๊ฐ€๋Šฅํ•œ ์ž„๋ฒ ๋”ฉ (ViT)
๊ฒฐ๊ณผํŒจ์น˜ ๊ฐ„ ์ƒ๋Œ€์ , ์ ˆ๋Œ€์  ์œ„์น˜ ์ •๋ณด๋ฅผ ๋ชจ๋ธ์ด ์ดํ•ดํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋จ

๐Ÿ“Œ ์™œ ํ•„์š”ํ•œ๊ฐ€?

  • Transformer๋Š” Self-Attention ๋ฉ”์ปค๋‹ˆ์ฆ˜์„ ์‚ฌ์šฉํ•˜์ง€๋งŒ,
    ์ž์ฒด์ ์œผ๋กœ ์ž…๋ ฅ ์ˆœ์„œ(ํŒจ์น˜ ์ˆœ์„œ)๋ฅผ ์ธ์‹ํ•  ๋Šฅ๋ ฅ์€ ์—†๋‹ค.
  • ๋”ฐ๋ผ์„œ ์ž…๋ ฅ ํ† ํฐ(ํ…์ŠคํŠธ ๋‹จ์–ด๋“ , ์ด๋ฏธ์ง€ ํŒจ์น˜๋“ )์ด ์„œ๋กœ ์–ด๋–ค ์ˆœ์„œ์— ์žˆ๋Š”์ง€ ์•Œ๋ ค์ฃผ๋Š” ์ •๋ณด๊ฐ€ ํ•„์š”ํ•˜๋‹ค.

์ด๋ฏธ์ง€์—์„œ๋„ ํŒจ์น˜๋“ค์€ ๋‹จ์ˆœํžˆ ๋ชจ์—ฌ ์žˆ๋Š” ๊ฒŒ ์•„๋‹ˆ๋ผ,
์ขŒ์šฐ, ์ƒํ•˜, ์ฃผ๋ณ€ ๊ด€๊ณ„๋ฅผ ํ†ตํ•ด ์˜๋ฏธ๋ฅผ ๊ฐ€์ง„๋‹ค.
๋”ฐ๋ผ์„œ โ€œ์–ด๋””์— ์žˆ๋Š” ํŒจ์น˜์ธ๊ฐ€?โ€๋ฅผ ๋ชจ๋ธ์ด ์•Œ ์ˆ˜ ์žˆ์–ด์•ผ ํ•œ๋‹ค.

๐Ÿ› ๏ธ ์–ด๋–ป๊ฒŒ ํ•˜๋Š”๊ฐ€?

  • ๊ฐ ํŒจ์น˜ ๋ฒกํ„ฐ์— Positional Encoding ๋ฒกํ„ฐ๋ฅผ ๋”ํ•˜๊ฑฐ๋‚˜ ํ•ฉ์ณ์„œ ์ž…๋ ฅ์— ํฌํ•จ์‹œํ‚จ๋‹ค.
  • ๋Œ€ํ‘œ์ ์ธ ๋ฐฉ์‹์€ ๋‘ ๊ฐ€์ง€๋‹ค:
    1. ์‚ฌ์ธยท์ฝ”์‚ฌ์ธ ๊ธฐ๋ฐ˜ ์ •์ (Positional Encoding)
      • ์ผ์ • ์ˆ˜ํ•™ ํ•จ์ˆ˜(์‚ฌ์ธ, ์ฝ”์‚ฌ์ธ)๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํŒจํ„ด์„ ์ƒ์„ฑ.
      • Transformer ๋…ผ๋ฌธ โ€œAttention Is All You Needโ€์—์„œ ์‚ฌ์šฉ๋จ.
    2. ํ•™์Šต ๊ฐ€๋Šฅํ•œ(learnable) ์œ„์น˜ ์ž„๋ฒ ๋”ฉ
      • ์œ„์น˜๋ณ„๋กœ ๋ฒกํ„ฐ๋ฅผ ์ƒ์„ฑํ•˜๊ณ , ์ด ๋ฒกํ„ฐ๋ฅผ ํ•™์Šต ๊ณผ์ •์—์„œ ํ•จ๊ป˜ ํ•™์Šตํ•จ.
      • ViT์—์„œ๋Š” ํ•™์Šต ๊ฐ€๋Šฅํ•œ ์œ„์น˜ ์ž„๋ฒ ๋”ฉ์„ ์ฃผ๋กœ ์‚ฌ์šฉํ–ˆ๋‹ค.

โžก๏ธ ๋•๋ถ„์— ๋ชจ๋ธ์€ โ€œ์ด ํŒจ์น˜๊ฐ€ ์–ด๋А ์œ„์น˜์— ์žˆ๋Š”์ง€โ€์— ๋Œ€ํ•œ ์ •๋ณด๋ฅผ ํ•™์Šตํ•  ์ˆ˜ ์žˆ๊ณ ,
Attention์ด ๋‹จ์ˆœํžˆ ๊ฐ’๋งŒ ๋ณด์ง€ ์•Š๊ณ , โ€œ๊ณต๊ฐ„์  ๋งฅ๋ฝโ€๊นŒ์ง€ ๊ณ ๋ คํ•˜๊ฒŒ ๋œ๋‹ค!

4. ํด๋ž˜์Šค ํ† ํฐ (Class Token) : ํŒจ์น˜ ์•ž์— ํ•œ์ค„์š”์•ฝ์ •๋ณด๋ฅผ ์ถ”๊ฐ€ํ•จ!!!

์ „์ฒด ์ด๋ฏธ์ง€๋ฅผ ๋Œ€ํ‘œํ•˜๋Š” [CLS] ํ† ํฐ์„ ์ถ”๊ฐ€ํ•ฉ๋‹ˆ๋‹ค!
๊ทธ๋ž˜์„œ, 16ร—16 ํฌ๊ธฐ์˜ ํŒจ์น˜๋กœ ์ชผ๊ฐœ๋ฉด ์ด 196๊ฐœ ํŒจ์น˜์•ž์— [CLS] ๊ฐ€ ๋“ค์–ด๊ฐ€์„œ, ํŒจ์น˜๋Š” ์ด 197๊ฐœ๊ฐ€ ๋ฉ๋‹ˆ๋‹ค! ์ด๋•Œ, CLS๋„ ๋‹ค๋ฅธ ํŒจ์น˜์™€ ๋™์ผํ•˜๊ฒŒ 768๊ฐœ ์š”์‡ผ๋กœ ๊ตฌ์„ฑ๋˜๊ฒ ์ง€์š”!? ์ด CLS ํ† ํฐ์ด ์ตœ์ข…์ ์œผ๋กœ ์ด๋ฏธ์ง€์˜ ๋ถ„๋ฅ˜ ๊ฒฐ๊ณผ๋ฅผ ๋Œ€ํ‘œํ•˜๊ฒŒ๋œ๋‹ค!!

๐Ÿ“Œ ๋ฌด์—‡์ธ๊ฐ€?

  • [CLS] ํ† ํฐ(Classification Token) ์€ Transformer ์ž…๋ ฅ ์‹œํ€€์Šค์˜ ๋งจ ์•ž์— ์ถ”๊ฐ€ํ•˜๋Š” ํŠน๋ณ„ํ•œ ํ† ํฐ์ด๋‹ค.
  • ์ด [CLS] ํ† ํฐ์€ ์ฒ˜์Œ๋ถ€ํ„ฐ ๋ชจ๋ธ์ด ํ•™์Šตํ•  ์ˆ˜ ์žˆ๋Š” ๊ฐ€์ค‘์น˜ ๋ฒกํ„ฐ๋กœ ์ดˆ๊ธฐํ™”๋œ๋‹ค.
  • ๋ชฉ์ :
    ์ž…๋ ฅ๋œ ์ด๋ฏธ์ง€(๋˜๋Š” ํ…์ŠคํŠธ) ์ „์ฒด์— ๋Œ€ํ•œ ์š”์•ฝ ์ •๋ณด๋ฅผ ๋‹ด์•„,
    ์ตœ์ข… ๋ถ„๋ฅ˜(Classification) ๊ฒฐ๊ณผ๋ฅผ ์–ป๊ธฐ ์œ„ํ•ด ์‚ฌ์šฉ๋œ๋‹ค.

๐Ÿ› ๏ธ ์–ด๋–ป๊ฒŒ ๋™์ž‘ํ•˜์ง€!?

  1. ์ž…๋ ฅ ์ด๋ฏธ์ง€ ํŒจ์น˜๋“ค์„ ๋‚˜๋ˆˆ ๋’ค, ๊ฐ ํŒจ์น˜๋ฅผ ์ž„๋ฒ ๋”ฉํ•œ๋‹ค.
  2. ์ด ์ž„๋ฒ ๋”ฉ ์‹œํ€€์Šค ๋งจ ์•ž์— [CLS] ์ž„๋ฒ ๋”ฉ ๋ฒกํ„ฐ๋ฅผ ์ถ”๊ฐ€ํ•œ๋‹ค.
  3. ์ „์ฒด ์‹œํ€€์Šค([CLS] + ํŒจ์น˜๋“ค)๋ฅผ Transformer ์ธ์ฝ”๋”์— ์ž…๋ ฅํ•œ๋‹ค.
  4. Transformer ์ธ์ฝ”๋”๋ฅผ ๊ฑฐ์นœ ํ›„,
    • [CLS] ํ† ํฐ์€ ์ž์‹ ์„ ์ค‘์‹ฌ์œผ๋กœ ์ „์ฒด ํŒจ์น˜๋“ค๊ณผ ์ƒํ˜ธ์ž‘์šฉ(Self-Attention)ํ•˜๋ฉฐ ์ •๋ณด๋ฅผ ๋ชจ์€๋‹ค.
  5. ์ตœ์ข…์ ์œผ๋กœ Transformer ์ถœ๋ ฅ ์ค‘ [CLS] ํ† ํฐ๋งŒ์„ ๊ฐ€์ ธ์™€
    Classification Head (MLP Layer)์— ๋„ฃ์–ด ์ตœ์ข… ์˜ˆ์ธก๊ฐ’์„ ๋งŒ๋“ ๋‹ค.

์‰ฝ๊ฒŒ ๋งํ•˜๋ฉด:
โ€œCLS ํ† ํฐ = ์ด ์ด๋ฏธ์ง€์— ๋Œ€ํ•œ ์š”์•ฝ ๋ฆฌํฌํ„ฐโ€
๋ชจ๋ธ์ด ํŒจ์น˜๋“ค์˜ ํŠน์„ฑ์„ ๋ชจ์•„ [CLS] ํ† ํฐ ํ•˜๋‚˜์— ์š”์•ฝํ•ด๋‘๋Š” ๊ตฌ์กฐ๋‹ค.


๐ŸŽฏ ์™œ ํ•„์š”ํ• ๊นŒ??

  • Transformer๋Š” ์‹œํ€€์Šค ์ „์ฒด๋ฅผ ์ฒ˜๋ฆฌํ•˜๊ธด ํ•˜์ง€๋งŒ,
    ์ถœ๋ ฅ ์—ญ์‹œ ๊ฐ ํ† ํฐ๋ณ„๋กœ ๋”ฐ๋กœ ๋‚˜์˜จ๋‹ค.
  • ์šฐ๋ฆฌ๋Š” โ€œ์ด ์ „์ฒด ์ž…๋ ฅ(์ด๋ฏธ์ง€)์ด ์–ด๋–ค ํด๋ž˜์Šค์— ์†ํ•˜๋‚˜?โ€๋ฅผ ์•Œ์•„์•ผ ํ•˜๋ฏ€๋กœ,
    ๋ชจ๋“  ์ •๋ณด๋ฅผ ๋Œ€ํ‘œํ•˜๋Š” ํ•˜๋‚˜์˜ ๋ฒกํ„ฐ๊ฐ€ ํ•„์š”ํ•˜๋‹ค.
  • [CLS] ํ† ํฐ์ด ๋ฐ”๋กœ ๊ทธ ์—ญํ• ์„ ํ•œ๋‹ค.

๐Ÿง  ViT์—์„œ [CLS] ํ† ํฐ์˜ ํŠน์ง•

  • [CLS] ํ† ํฐ์€ ํ•™์Šต ๊ฐ€๋Šฅํ•œ ์ดˆ๊ธฐ๋žœ๋ค๊ฐ’(random initialized)์œผ๋กœ ์‹œ์ž‘๋œ๋‹ค.
  • Training ๋™์•ˆ ๋‹ค๋ฅธ ํŒจ์น˜๋“ค๊ณผ Attention์„ ์ฃผ๊ณ ๋ฐ›์œผ๋ฉฐ ์ ์  ๋” โ€œ์ด๋ฏธ์ง€ ์ „์ฒด๋ฅผ ์š”์•ฝํ•˜๋Š”โ€ ๋ฒกํ„ฐ๋กœ ์ง„ํ™”ํ•œ๋‹ค.
  • ์ด ์ตœ์ข… ๋ฒกํ„ฐ๋ฅผ ํ†ตํ•ด ์ด๋ฏธ์ง€ ๋ถ„๋ฅ˜, ํŠน์„ฑ ์ถ”์ถœ, ๋‹ค์šด์ŠคํŠธ๋ฆผ ํƒœ์Šคํฌ ๋“ฑ์— ํ™œ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค.

โœจ ํ•ต์‹ฌ ์š”์•ฝ

๊ตฌ๋ถ„์„ค๋ช…
์—ญํ• ์ „์ฒด ์ž…๋ ฅ(์ด๋ฏธ์ง€)์„ ๋Œ€ํ‘œํ•˜๋Š” ์š”์•ฝ ์ •๋ณด ์ €์žฅ
๋ฐฉ์‹์ž…๋ ฅ ํŒจ์น˜ ์•ž์— [CLS] ํ† ํฐ ์ถ”๊ฐ€ ํ›„, Attention์„ ํ†ตํ•ด ์ •๋ณด ์ง‘์•ฝ
๊ฒฐ๊ณผ ์‚ฌ์šฉ์ตœ์ข… ๋ถ„๋ฅ˜ ๊ฒฐ๊ณผ (Classification Head์— ์—ฐ๊ฒฐ)

5. ํŠธ๋žœ์Šคํฌ๋จธ ์ธ์ฝ”๋” (Transformer Encoder)

ViT์˜ Transformer Encoder๋Š” โ€œ์ด๋ฏธ์ง€ ์ „์ฒด๋ฅผ ๋ณด๋Š” ๋Šฅ๋ ฅโ€ ์„ ์ž์—ฐ์Šค๋Ÿฝ๊ฒŒ ๋ชจ๋ธ์— ๋ถ€์—ฌํ•˜๋Š” ํ•ต์‹ฌ ๋ชจ๋“ˆ!!

๐Ÿ› ๏ธ Transformer Encoder ๋ธ”๋ก ๊ตฌ์„ฑ ์š”์†Œ

Transformer Encoder ํ•˜๋‚˜๋Š” ํฌ๊ฒŒ ๋‘ ๊ฐ€์ง€ ๋ชจ๋“ˆ๋กœ ๊ตฌ์„ฑ๋œ๋‹ค:

  1. Multi-Head Self-Attention (MSA)
  2. Multi-Layer Perceptron (MLP)

์ด ๋‘ ๋ธ”๋ก ์‚ฌ์ด์—๋Š” ํ•ญ์ƒ Layer Normalization๊ณผ Residual Connection(์ž”์ฐจ ์—ฐ๊ฒฐ) ์ด ์ถ”๊ฐ€๋œ๋‹ค.

๐Ÿงฉ ์ž์„ธํ•œ ๊ตฌ์กฐ์˜ ํ๋ฆ„

  1. LayerNorm
    ์ž…๋ ฅ ์‹œํ€€์Šค(ํŒจ์น˜ + [CLS])์— ๋Œ€ํ•ด ๋จผ์ € ์ •๊ทœํ™” ์ˆ˜ํ–‰.

  2. Multi-Head Self-Attention (MSA)
    • ๊ฐ๊ฐ์˜ ํ† ํฐ์ด ๋‹ค๋ฅธ ๋ชจ๋“  ํ† ํฐ๊ณผ ์ƒํ˜ธ์ž‘์šฉํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•œ๋‹ค.
    • ์—ฌ๋Ÿฌ ๊ฐœ์˜ Attention Head๊ฐ€ ๋ณ‘๋ ฌ๋กœ ๋™์ž‘ํ•˜์—ฌ ๋‹ค์–‘ํ•œ ๊ด€๊ณ„๋ฅผ ํฌ์ฐฉ.
    • Self-Attention์„ ํ†ตํ•ด, ์ด๋ฏธ์ง€ ์ „์ฒด ํŒจ์น˜ ๊ฐ„ ๊ธ€๋กœ๋ฒŒ(Global) ์ปจํ…์ŠคํŠธ๋ฅผ ํ•™์Šตํ•œ๋‹ค.
  3. Residual Connection
    • Attention ๊ฒฐ๊ณผ๋ฅผ ์ž…๋ ฅ์— ๋”ํ•ด์คŒ.
    • ํ•™์Šต์ด ๋” ์•ˆ์ •๋˜๊ณ , Gradient Vanishing ๋ฌธ์ œ๋ฅผ ์™„ํ™”.
  4. LayerNorm
    • ๋‹ค์‹œ ํ•œ๋ฒˆ ์ •๊ทœํ™” ์ˆ˜ํ–‰.
  5. Multi-Layer Perceptron (MLP)
    • ๋‘ ๊ฐœ์˜ Linear Layer(์„ ํ˜• ๋ณ€ํ™˜)์™€ ์ค‘๊ฐ„์˜ ํ™œ์„ฑํ™” ํ•จ์ˆ˜(GELU)๋ฅผ ํฌํ•จ.
    • ๊ฐ ํ† ํฐ(ํŒจ์น˜)์˜ feature representation์„ ๋” ๋ณต์žกํ•˜๊ฒŒ ๋ณ€ํ™˜.
  6. Residual Connection
    • MLP ๊ฒฐ๊ณผ๋ฅผ ์ž…๋ ฅ์— ๋”ํ•ด์คŒ.

๐Ÿ”„ ์ „์ฒด ๋ธ”๋ก ํ”Œ๋กœ์šฐ

1
2
3
4
5
6
7
8
์ž…๋ ฅ (ํŒจ์น˜ ์‹œํ€€์Šค + [CLS])
  โ†“ LayerNorm
  โ†“ Multi-Head Self-Attention
  โ†“ Residual Connection (์ž…๋ ฅ + Attention ์ถœ๋ ฅ)
  โ†“ LayerNorm
  โ†“ MLP (2 Linear + GELU)
  โ†“ Residual Connection (์ž…๋ ฅ + MLP ์ถœ๋ ฅ)
์ถœ๋ ฅ (๋‹ค์Œ ๋ธ”๋ก ๋˜๋Š” ์ตœ์ข… ๊ฒฐ๊ณผ๋กœ ์ „๋‹ฌ)

๐ŸŽฏ ํŠธ๋žœ์Šคํฌ๋จธ ์ธ์ฝ”๋”์˜ ํ•ต์‹ฌ ์—ญํ• 

๊ตฌ์„ฑ ์š”์†Œ์—ญํ• 
MSA (Self-Attention)ํŒจ์น˜๋“ค ๊ฐ„์˜ ๊ด€๊ณ„๋ฅผ ํ•™์Šต (๊ธ€๋กœ๋ฒŒ ์ปจํ…์ŠคํŠธ ํฌ์ฐฉ)
MLP๊ฐ ํ† ํฐ์˜ ํŠน์„ฑ์„ ๋น„์„ ํ˜•์ ์œผ๋กœ ๋ณ€ํ™˜
LayerNormํ•™์Šต ์•ˆ์ •์„ฑ ํ–ฅ์ƒ
Residual Connection์ •๋ณด ํ๋ฆ„์„ ๋ณด์กดํ•˜๊ณ , ํ•™์Šต ์•ˆ์ •ํ™”

๐Ÿง  ViT์—์„œ ํŠธ๋žœ์Šคํฌ๋จธ ์ธ์ฝ”๋”์˜ ์˜๋ฏธ

  • CNN์ด ์ง€์—ญ์ (local) ํŒจํ„ด์„ ์ค‘์‹ฌ์œผ๋กœ ์ฒ˜๋ฆฌํ•˜๋Š” ๊ฒƒ๊ณผ ๋‹ฌ๋ฆฌ,
  • ํŠธ๋žœ์Šคํฌ๋จธ ์ธ์ฝ”๋”๋Š” ์ „์—ญ(Global) ํŒจ์น˜ ๊ฐ„ ๊ด€๊ณ„๋ฅผ ํ•œ ๋ฒˆ์— ๋ชจ๋ธ๋งํ•œ๋‹ค.
  • ํŠนํžˆ [CLS] ํ† ํฐ์€ ์ด ๊ณผ์ • ์†์—์„œ ์ „์ฒด ์ด๋ฏธ์ง€๋ฅผ ์š”์•ฝํ•˜๋Š” ์—ญํ• ์„ ํ•˜๋„๋ก ํ•™์Šต๋œ๋‹ค.

6. ๋ถ„๋ฅ˜ ํ—ค๋“œ (Classification Head)

  • ์ตœ์ข… [CLS] ํ† ํฐ์„ MLP ๋ถ„๋ฅ˜๊ธฐ์— ํ†ต๊ณผ์‹œ์ผœ ์ด๋ฏธ์ง€ ํด๋ž˜์Šค๋ฅผ ์˜ˆ์ธก.
  • ์—ฌ๊ธฐ์„œ๋Š” ๊ธฐ์กด resnet ๋“ฑ CNN ๋ถ„๋ฅ˜์™€ ๋งˆ์ฐฌ๊ฐ€์ง€์ง€์š”!? 768์ฐจ์›์˜ CLS๊ฐ€ MLP๋ฅผ ํ†ตํ•ด classification์„ ์ง„ํ–‰ํ•ฉ๋‹ˆ๋‹ค!!

๐Ÿš€ ViT์˜ ์ถ”๋ก (inference) ๊ณผ์ • ์‚ดํŽด๋ณด๊ธฐ!

์œ„ ๊ณผ์ •์„ ํ†ตํ•ด ํ•™์Šต๋œ Vision Transformer ๋ชจ๋ธ์—
์ƒˆ๋กœ์šด ์ด๋ฏธ์ง€๊ฐ€ ์ž…๋ ฅ๋˜์—ˆ์„ ๋•Œ ์–ด๋–ค ํ๋ฆ„์œผ๋กœ ๋ถ„๋ฅ˜๊ฐ€ ์ด๋ฃจ์–ด์ง€๋Š”์ง€ ์ •๋ฆฌํ•ด๋ด…์‹œ๋‹ค!

๐Ÿ“ธ 1. ์ด๋ฏธ์ง€๋ฅผ ํŒจ์น˜๋กœ ์ชผ๊ฐœ๊ธฐ

  • ์ž…๋ ฅ ์ด๋ฏธ์ง€๋ฅผ 16ร—16 ํฌ๊ธฐ์˜ ์ž‘์€ ์กฐ๊ฐ๋“ค๋กœ ๋‚˜๋ˆ•๋‹ˆ๋‹ค.
  • ์˜ˆ๋ฅผ ๋“ค์–ด 224ร—224 ์ด๋ฏธ์ง€๋Š” ์ด 196๊ฐœ ํŒจ์น˜๋กœ ๋ถ„ํ• ๋ฉ๋‹ˆ๋‹ค.

๐Ÿ”— 2. CLS ํ† ํฐ๊ณผ ํŒจ์น˜ ์ž„๋ฒ ๋”ฉ ๊ฒฐํ•ฉ

  • ํ•™์Šต ๊ณผ์ •์—์„œ ์ค€๋น„๋œ [CLS] ์ž„๋ฒ ๋”ฉ ๋ฒกํ„ฐ๋ฅผ ๊ฐ€์žฅ ์•ž์— ์ถ”๊ฐ€ํ•ฉ๋‹ˆ๋‹ค.
  • ๋”ฐ๋ผ์„œ ์ž…๋ ฅ ์‹œํ€€์Šค๋Š” ์ด 197๊ฐœ ๋ฒกํ„ฐ๋กœ ๊ตฌ์„ฑ๋ฉ๋‹ˆ๋‹ค.
    (1๊ฐœ [CLS] + 196๊ฐœ ํŒจ์น˜)

๐Ÿง  3. Transformer Encoder์— ์ž…๋ ฅ

  • ์ด 197๊ฐœ์˜ ๋ฒกํ„ฐ๋ฅผ Transformer Encoder์— ํ†ต๊ณผ์‹œํ‚ต๋‹ˆ๋‹ค.
  • Transformer๋Š” ๊ฐ ํŒจ์น˜์™€ [CLS] ๊ฐ„์˜ ๊ด€๊ณ„๋ฅผ Self-Attention์œผ๋กœ ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค.
  • [CLS]๋Š” ๋ชจ๋“  ํŒจ์น˜๋“ค์˜ ์ •๋ณด๋ฅผ ์š”์•ฝํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

๐ŸŽฏ 4. CLS ํ† ํฐ ์ถ”์ถœ

  • Transformer ์ถœ๋ ฅ ๊ฒฐ๊ณผ ์ค‘ ๋งจ ์•ž์— ์žˆ๋Š” [CLS] ํ† ํฐ๋งŒ ๋”ฐ๋กœ ๊บผ๋ƒ…๋‹ˆ๋‹ค.
  • ์ด ๋ฒกํ„ฐ๋Š” ์ž…๋ ฅ ์ด๋ฏธ์ง€์˜ ์ „์ฒด ํŠน์ง•์„ ์ง‘์•ฝํ•œ ํ‘œํ˜„์ž…๋‹ˆ๋‹ค.

๐Ÿ 5. MLP Head๋ฅผ ํ†ตํ•ด ๋ถ„๋ฅ˜ ์ˆ˜ํ–‰

  • ์ตœ์ข… [CLS] ๋ฒกํ„ฐ๋ฅผ MLP Head (Multi-Layer Perceptron)์— ์ž…๋ ฅํ•ฉ๋‹ˆ๋‹ค.
  • MLP Head๋Š” ์ด ๋ฒกํ„ฐ๋ฅผ ๋ฐ›์•„ ํด๋ž˜์Šค(Class) ํ™•๋ฅ ์„ ์˜ˆ์ธกํ•ฉ๋‹ˆ๋‹ค.
    • (์˜ˆ: ๊ณ ์–‘์ด, ๊ฐ•์•„์ง€, ์ž๋™์ฐจ ์ค‘ ํ•˜๋‚˜)

โœ… ์ตœ์ข… ViT์˜ ์ถ”๋ก  ํ๋ฆ„ ์š”์•ฝ

1
2
3
4
5
6
7
8
9
10
11
12
13
์ด๋ฏธ์ง€ ์ž…๋ ฅ
โ†“
ํŒจ์น˜๋กœ ๋ถ„ํ• 
โ†“
[CLS] + ํŒจ์น˜๋“ค ํ•ฉ์น˜๊ธฐ
โ†“
Transformer Encoder ํ†ต๊ณผ
โ†“
[CLS] ๋ฒกํ„ฐ ์ถ”์ถœ
โ†“
MLP Head๋กœ ๋ถ„๋ฅ˜
โ†“
์ตœ์ข… ์˜ˆ์ธก ๊ฒฐ๊ณผ ์ถœ๋ ฅ!

ViT ๋“ฑ์žฅ ์ดํ›„, ๋ฐ”๋€Œ์–ด ๋ฒ„๋ฆฐ ์„ธ์ƒ!!

๋Œ€๊ทœ๋ชจ ๋ฐ์ดํ„ฐ์˜ ์ค‘์š”์„ฑ

  • ViT๋Š” ์ƒ์‹์  ์ง€์‹์ด ์•ฝํ•˜๊ธฐ ๋•Œ๋ฌธ์—, ์ž‘์€ ๋ฐ์ดํ„ฐ์…‹์—์„œ๋Š” CNN๋ณด๋‹ค ์•ฝํ•  ์ˆ˜ ์žˆ์ง€๋งŒ,
  • ์ˆ˜์–ต ์žฅ ์ด์ƒ์˜ ๋Œ€๊ทœ๋ชจ ๋ฐ์ดํ„ฐ๋กœ ์‚ฌ์ „ํ•™์Šต(pretraining) ์‹œ CNN์„ ๋Šฅ๊ฐ€ํ•˜๋Š” ์„ฑ๋Šฅ์„ ๋ณด์˜€์Šต๋‹ˆ๋‹ค.

๊ณ„์‚ฐ ํšจ์œจ์„ฑ

  • ๋งค์šฐ ํฐ ๋ชจ๋ธ์˜ ๊ฒฝ์šฐ, CNN์— ๋น„ํ•ด ์‚ฌ์ „ํ•™์Šต์— ํ•„์š”ํ•œ ๊ณ„์‚ฐ๋Ÿ‰์ด ๋” ์ ์„ ์ˆ˜ ์žˆ์Œ.

์ „์—ญ ๊ด€๊ณ„ ๋ชจ๋ธ๋ง

  • Self-Attention์„ ํ†ตํ•ด ์ด๋ฏธ์ง€ ๋‚ด ์žฅ๊ฑฐ๋ฆฌ ์˜์กด์„ฑ(long-range dependencies) ์„ ์ž์—ฐ์Šค๋Ÿฝ๊ฒŒ ๋ชจ๋ธ๋ง.
  • Self-Attention์€ ์ž…๋ ฅ ํ† ํฐ(= ์ด๋ฏธ์ง€ ํŒจ์น˜๋“ค) ๊ฐ„ ๊ฑฐ๋ฆฌ๋ฅผ ์‹ ๊ฒฝ ์“ฐ์ง€ ์•Š๊ธฐ์—, (CNN์€ ์ด๋ฏธ์ง€ ๋‚ด์˜ ๊ฑฐ๋ฆฌ๊ฐ€ ์ค‘์š”ํ•œ์š”์†Œ!!)
  • ๋ชจ๋“  ํŒจ์น˜๊ฐ€ ๋ชจ๋“  ํŒจ์น˜์™€ ๋ฐ”๋กœ ์ง์ ‘ ์—ฐ๊ฒฐ๋ฉ๋‹ˆ๋‹ค~!!!
  • ํŒจ์น˜ 1๋ฒˆ์ด ํŒจ์น˜ 196๋ฒˆ๊นŒ์ง€ ๋ฐ”๋กœ โ€œ์–˜ ์ค‘์š”ํ•ด?โ€ ํ•˜๊ณ  ๋ฌผ์–ด๋ณผ ์ˆ˜ ์žˆ์–ด์š”.
  • ๊ฑฐ๋ฆฌ 1์นธ, 10์นธ, 100์นธ ๋–จ์–ด์ ธ ์žˆ์–ด๋„ ์ƒ๊ด€์—†์–ด์š”.
  • ์ฆ‰, ๋ฉ€๋ฆฌ ๋–จ์–ด์ง„ ํŒจ์น˜๋ผ๋ฆฌ๋„ ๊ณง๋ฐ”๋กœ ์˜ํ–ฅ์„ ์ฃผ๊ณ ๋ฐ›์„ ์ˆ˜ ์žˆ๋‹ค!

ํ•ด์„ ๊ฐ€๋Šฅ์„ฑ

  • Attention Map์„ ์‹œ๊ฐํ™”ํ•˜์—ฌ (CAM์ฒ˜๋Ÿผ!!)
    ๋ชจ๋ธ์ด ์ด๋ฏธ์ง€์˜ ์–ด๋–ค ๋ถ€๋ถ„์— ์ฃผ๋ชฉํ•˜๋Š”์ง€ ์ง๊ด€์ ์œผ๋กœ ํ•ด์„ํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋จ.

๐ŸŒŽ ViT์˜ ์˜ํ–ฅ๋ ฅ

  • ViT์˜ ์„ฑ๊ณต ์ดํ›„, ์ปดํ“จํ„ฐ ๋น„์ „ ๋ถ„์•ผ์—์„œ Transformer ์•„ํ‚คํ…์ฒ˜ ์—ฐ๊ตฌ๊ฐ€ ํญ๋ฐœ์ ์œผ๋กœ ์ฆ๊ฐ€ํ–ˆ๋‹ค.
  • ์ดํ›„ ๋‹ค์–‘ํ•œ ๋น„์ „ ํŠธ๋žœ์Šคํฌ๋จธ ๊ณ„์—ด ๋ชจ๋ธ (DeiT, Swin Transformer ๋“ฑ)์ด ๋“ฑ์žฅํ•˜์—ฌ:
    • ์ด๋ฏธ์ง€ ์ธ์‹ (Image Classification)
    • ๊ฐ์ฒด ๊ฒ€์ถœ (Object Detection)
    • ์ด๋ฏธ์ง€ ๋ถ„ํ•  (Image Segmentation) ๋“ฑ ์—ฌ๋Ÿฌ ๋ถ„์•ผ์—์„œ ๋›ฐ์–ด๋‚œ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์ฃผ๊ณ  ์žˆ๋‹ค.
  • DINO, CLIP ๋“ฑ ๋‹ค์–‘ํ•œ ViT ๊ธฐ๋ฐ˜์˜ ๋ชจ๋ธ๋“ค์ด ๋‚˜์™”์Šต๋‹ˆ๋‹ค!!

๐Ÿ”— ์ฐธ๊ณ 


Vision Transformer๋Š” ๋‹จ์ˆœํžˆ ๋ชจ๋ธ ํ•˜๋‚˜๋ฅผ ๋„˜์–ด,
โ€œ๋น„์ „ ๋ถ„์•ผ์—์„œ๋„ Transformer ์‹œ๋Œ€๋ฅผ ์—ด์–ด์ –ํžŒโ€ ์ง„์ •ํ•œ ํŒจ๋Ÿฌ๋‹ค์ž„ ์ „ํ™˜์ด์—ˆ์Šต๋‹ˆ๋‹ค.

This post is licensed under CC BY 4.0 by the author.